| Home | Instructions | Datasets | Citing | Disclaimer | |
DATASET 4
This DATASET 4 is a subset of the DATASET1 obtained by clustering the protein sequences based on their sequence similarity to remove close homologues which may cause problems with certain applications. Clustering was performed using CD-HIT suite (Huang et al., 2010) with the sequence identity cut-off 30%. Sequence with the highest number of variations was chosen to represent each cluster. This dataset is also composed of neutral and pathogenic datasets.
Clustered pathogenic single nucleotide polymorphisms from Datset1
This dataset contains substitutions in 954 representative sequences (clusters)
Download: Clustered pathogenic dataset (Thusberg et al., 2011)
Download: Clustered pathogenic dataset annotated with VariO**
Clustered neutral single nucleotide variations from Dataset1
This is the negative (neutral) dataset or non synonymous coding SNV dataset comprising 15,721 human non synonymous coding SNVs on 6045 representative sequences (clusters) from the dbSNP database build 131.
Download: Clustered neutral dataset (Thusberg et al., 2011)*
Download: Clustered neutral dataset annotated with VariO**
References:
Thusberg J, Olatubosun A, Vihinen M. Performance of mutation pathogenicity prediction methods on missense variants. Hum Mutat. 2011, 32(4):358-68. PUBMED
Huang Y, Niu B, Gao Y, Fu L, Li W. CD-HIT Suite: a web server for clustering and comparing biological sequences. Bioinformatics. 2010. 26:680-682. PUBMED
* Last updated: 2013-01-02.
** Tab-delimited file, updated: 2013-11-12.