| Home | Instructions | Datasets | Citing | Disclaimer | |
DATASET 1
This is the neutral dataset or nonsynonymous coding SNVs comprising 21,170 human non synonymous coding SNVs with allele frequency >0.01 and chromosome sample count >= 50 from the dbSNP database build 131. This set was used for training PON-P.
DATASET 2
This is a subset of DATASET 1 from which cancer cases were removed. It contains both neutral and pathogenic variants
DATASET 3
Amino acid substitutions annotated to affect protein activity were collected from the Protein Mutant Database (PMD). This set was used for testing PON-P.
DATASET 4
This is a subset of the DATASET1 obtained by clustering the protein sequences based on their sequence similarity to remove close homologues.
DATASET 5
This is a subset of the DATASET2 obtained by clustering the protein sequences based on their sequence similarity to remove close homologues.
DATASET 6
This is a subset of the DATASET 3 extracted by clustering the protein sequences based on their sequence similarity.
DATASET 7
This is a subset of the DATASET 2 filtered by the availability of features used in PON-P2. This dataset was used for training and testing PON-P2.
DATASET 8
The dataset was developed and used for the evaluation of prediction tools and for training of the consensus classifier PredictSNP.
DATASET 9
Protein-specific and general pathogenicity predictors for amino acid substitutions
Reference: Riera C, Padilla N and de la Cruz X, 2016. The Complementarity Between Protein-Specific and General Pathogenicity Predictors for Amino Acid Substitutions. Hum Mutat 37:1013–1024. PUBMED
DATASET 10
This dataset contains in F1 69141 SNVs from Inherited Disease (weighted) SwissProt/TrEMBL dataset from humsavar, F2 94995 cancer-associated pathogenic training cases and F3 69141 disease-specific SwissProt/TrEMBL (2014_05) humsavar variations.
Reference: Rogers, M. F., Shihab, H. A., Mort, M., Cooper, D. N., Gaunt, T. R., & Campbell, C. (2017). FATHMM-XF: accurate prediction of pathogenic point mutations via extended features. Bioinformatics (Oxford, England), 34(3), 511-513. PUBMED
DATASET 11
2600 disease-causing or benign variants from ClinVar and the 1000 Genomes Project (each of the 3 possible genotypes found in at least 50 samples). F2 contains 2200 disease-causing and benign variants from the 1000 Genomes Project. F3 contains 1100 pathogenic variants. F4 contains 1100 benign variants.
Reference: Schwarz, M.J., Cooper, D.N., Schuelke, M., Seelow, D., MutationTaster2: mutation prediction for the deep-sequencing age, Nature Methods, 11(361), 2014/03/28/online. PUBMED
DATASET 12
Dataset F1 includes 9,477 variants (5,740 deleterious, 3737 neutral). F2 contains 1,542 human variants not included in HumDiv. F3 contains a subset of 383 variants found naturally in the human population. F4 contains 949 variants as a subset of the Human Outgroup set that includes only variants that have been identified in the human population. F5 contains 4992 variants in non-human proteins. F6 contains 6555 ClinVar variants with reliable structural models.
Reference: Baugh, E. H., Simmons-Edler, R., Müller, C. L., Alford, R. F., Volfovsky, N., Lash, A. E., & Bonneau, R. (2016). Robust classification of protein variation using structural modelling and large-scale data integration. Nucleic acids research, 44(6), 2501-13. PUBMED
DATASET 13
Dataset of 33483 positive and negative nonsynonymous SNVs.
Reference: Korvigo, I., Afanasyev, A., Romashchenko, N., & Skoblov, M. (2018). Generalising better: Applying deep learning to integrate deleteriousness prediction scores for whole-exome SNV studies. PloS one, 13(3), e0192829.doi:10.1371/journal.pone.0192829. PUBMED
DATASET 14
Variants from five datasets: HumVar, ExoVar, VariBenchSelected, predictSNPSelected and SwissVarSelected. 21946 single amino acid variants that have been structurally characterized.
Reference: L Ponzoni, I Bahar, Structural dynamics is a determinant of the functional significance of missense variants, Proceedings of the National Academy of Sciences Apr 2018, 115 (16) 4164-4169; DOI: 10.1073/pnas.1715896115. PUBMED
DATASET 15
CADD training data of 16 627 775 ‘observed’ and 49 407 057 ‘simulated’ variants.
Reference: Quang, D., Chen, Y., & Xie, X. (2014). DANN: a deep learning approach for annotating the pathogenicity of genetic variants. Bioinformatics (Oxford, England), 31(5), 761-3. PUBMED
DATASET 16
Disease-associated single amino acid substitutions. F1 contains 876 proteins with 3257 disease-associated and 2118 benign variations. F2 is an independent dataset, which consisted of 218 proteins with 696 disease-associated and 456 benign variations.
Reference: Li, Y., Wen, Z., Xiao, J., Yin, H., Yu, L., Yang, L., & Li, M. (2011). Predicting disease-associated substitution of a single amino acid by analyzing residue interactions. BMC bioinformatics, 12, 14. doi:10.1186/1471-2105-12-14. PUBMED
DATASET 17
F1 contains 18633 variants from VariBench and F2 contains 64163 SAVs from Humsavar.
Reference: Yates, C. M., Filippis, I., Kelley, L. A., & Sternberg, M. J. (2014). SuSPect: enhanced prediction of single amino acid variant (SAV) phenotype using network features. Journal of molecular biology, 426(14), 2692-701. PUBMED
DATASET 18
64 Nonsynonymous nsSNVs from Centers for Mendelian Genomics (CMG), 158 variants from Deciphering Developmental Disorders Study (DDDS), 15702 and 3562 nonsynonymous EXOVAR and ClinVar disease-causing variants, 512370 variants from the 1000 Genomes Project, 51599 segmentally duplicated regions from hg19, 11763 nonsynonymous changes based on the GENCODE 19, 1048544 variants in ESP6500 dataset.
Reference: Gosalia, N., Economides, A. N., Dewey, F. E., & Balasubramanian, S. (2017). MAPPIN: a method for annotating, predicting pathogenicity and mode of inheritance for nonsynonymous variants. Nucleic acids research, 45(18), 10393-10402. PUBMED
DATASET 19
Dataset F1 with 14 191 Mendelian disease-causing variations and 22 001 neutral variations. There are totally 88184 variations in test datasets F2, F3, F4, F5 and F6.
Reference: Dong, C., Wei, P., Jian, X., Gibbs, R., Boerwinkle, E., Wang, K., & Liu, X. (2014). Comparison and integration of deleteriousness prediction methods for nonsynonymous SNVs in whole exome sequencing studies. Human molecular genetics, 24(8), 2125-37. PUBMED
DATASET 20
48534 variants in training dataset in F1 and 1408 variants in test dataset in F2 from ClinVar database.
Reference: Capriotti, E., & Fariselli, P. (2017). PhD-SNPg: a webserver and lightweight tool for scoring single nucleotide variants. Nucleic acids research, 45(W1), W247-W252. PUBMED
DATASET 21
1161 variants for multigene panel test (MGPT), for BRCA1, BRCA2, CDH1, PALB2, PTEN, TP53, MLH1, MSH2, MSH6 and PMS2 from ClinVar.
Reference: Qian D, Li S, Tian Y, Clifford JW, Sarver BAJ, et al. (2018) A Bayesian framework for efficient and accurate variant prediction. PLOS ONE 13(9): e0203553. https://doi.org/10.1371/journal.pone.0203553. PUBMED
DATASET 22
F1 contains 337 variants distributed across 43 ADME genes and F2 contains 180 loss-of function and neutral variants.
Reference: Zhou Y., Mkrtchian S., Kumondai, Masaki, Hiratsuka, Masahiro, Lauschke, Volker M., (2018), An optimized prediction framework to assess the functional impact of pharmacogenetic variants, The Pharmacogenomics Journal, pp. 1473-1150, DO - 10.1038/s41397-018-0044-2. PUBMED
DATASET 23
F1 contains 25480 nucleotide variants for Mendelian Diseases. F2 contains 12050 nucleotide variants for Complex diseases. F3 contains 142722 nucleotode variants for somatic cancers. F4 contains 16716 amino acid variants for Mendelian Diseases. F5 contains 71674 amino acid variants for somatic cancers.
Reference: Bendl J, Musil M, Štourač J, Zendulka J, Damborský J, Brezovský J (2016) PredictSNP2: A Unified Platform for Accurately Evaluating SNP Effects by Exploiting the Different Characteristics of Variants in Distinct Genomic Regions. PLoS Comput Biol 12(5): e1004962. doi:10.1371/journal. pcbi.1004962. PUBMED
DATASET 24
F1 contains 45573 variants of all species used for training. F2 contains 306 variants from animals used for training. F3 contains 5360 variants of all_species used as blind dataset. F4 contains 324 variants from animals used as blind dataset. F5 contains 3836 human variants used as blind dataset. F6 contains 1109 plant variants used as blind dataset. F7 contains 48176 human variants used as training dataset. F8 contains 4154 plant variants used as training dataset.
Reference:Y.Yang, A.Shao, M.Vihinen, PON-All: Amino Acid Substitution Tolerance Predictor for All Organisms, Mol. Biosci., 16 June 2022 | https://doi.org/10.3389/fmolb.2022.867572 PUBMED
DATASET 25
The file contains three sheets to 1) mouse dataset of 377 variants, 189 deleterious, 188 neutral, 2) dog dataset 207 variants, 103 deleterious, 104 neutral, 3) cattle dataset 62 variants, 30 deleterious, 32 neutral
Reference: Plekhanova, E., Nuzhdin, S.V., Utkin, L.V., Samsonova, M.G. Prediction of deleterious mutations in coding regions of mammals with transfer learning. Wiley PMID: 30622632 PMCID: PMC6304693 DOI: 10.1111/eva.12607. PUBMED
DATASET 26
The file contains 2,617 amino acid altering mutations in 960 A. thaliana genes.
Reference: Kono, T.J.Y., Lei, L., Shih, C.H., Hoffman, P.J., Morrell, P.L., and Fay, J.C. (2018). Comparative genomics approaches accurately predict deleterious variants in plants. G3 (Bethesda) 8, 3321-3329. PUBMED
DATASET 27
F contains 4409 variants
Reference: Kovalev, M.S., Igolkina, A.A., Samsonova, M.G., and Nuzhdin, S.V. (2018). A pipeline for classifying deleterious coding mutations in agricultural plants. Front Plant Sci 9, 1734. PUBMED
DATASET 28
Reference: Pejaver, V., Urresti, J., Lugo-Martinez, J., Pagel, K.A., Lin, G.N., Nam, H.J., Mort, M., Cooper, D.N., Sebat, J., Iakoucheva, L.M., et al. (2020). Inferring the molecular and phenotypic impact of amino acid variants with MutPred2. Nat Commun 11, 5918. PUBMED
DATASET 29F1 contains 43000 pathogenic and F2 contains 43000 benign data from gnomAD database.
Reference: J.Pei, L.N Kinch, Z.Otwinowski, N.V Grishin (2020). Mutation severity spectrum of rare alleles in the human genome is predictive of disease type. PLoS Comput Biol, 16. PUBMED
DATASET 30F1 contains rare (MAF < 0.5%) ClinVar44 variants in the core set. F2 contains extremely rare (MAF < 10−6) ClinVar44 variants
Reference: Y.Wu, R.Li, S.Sun, J.Weile, F.P Roth, Improved pathogenicity prediction for rare human missense variants, Am J Hum Genet;108(10):1891-1906. doi: 10.1016/j.ajhg.2021.08.012. PUBMED
DATASET 31Reference: Mathieu Quinodoz, Virginie G Peter, Katarina Cisarova, Beryl Royer-Bertrand, Peter D Stenson, David N Cooper, Sheila Unger, Andrea Superti-Furga, Carlo Rivolta, Analysis of missense variants in the human genome reveals widespread gene-specific clustering and improves prediction of pathogenicity, Am J Hum Genet. 2022 Mar 3;109(3):457-470. doi: 10.1016/j.ajhg.2022.01.006. PUBMED
DATASET 32
Reference: Jiang, T., Wang, K., Fang, L., MutFormer: A context-dependent transformer-based model to predict pathogenic missense mutations, https://doi.org/10.48550/arXiv.2110.14746 PUBMED
Last updated: 2022-06-28 by Niloofar Shirvanizadeh.