VariBench_logo

A benchmark database for variations


Home | Instructions | Datasets | Citing | Disclaimer |


i. Training datasets

Variations affecting protein tolerance/pathogenicity

DATASET 1

This is the neutral dataset or nonsynonymous coding SNVs comprising 21,170 human non synonymous coding SNVs with allele frequency >0.01 and chromosome sample count >= 50 from the dbSNP database build 131. This set was used for training PON-P.

    DATASET 1

DATASET 2

This is a subset of DATASET 1 from which cancer cases were removed. It contains both neutral and pathogenic variants

    DATASET 2

DATASET 3

Amino acid substitutions annotated to affect protein activity were collected from the Protein Mutant Database (PMD). This set was used for testing PON-P.

    DATASET 3

DATASET 4

This is a subset of the DATASET1 obtained by clustering the protein sequences based on their sequence similarity to remove close homologues.

    DATASET 4

DATASET 5

This is a subset of the DATASET2 obtained by clustering the protein sequences based on their sequence similarity to remove close homologues.

    DATASET 5

DATASET 6

This is a subset of the DATASET 3 extracted by clustering the protein sequences based on their sequence similarity.

    DATASET 6

DATASET 7

PON-P2 dataset

This is a subset of the DATASET 2 filtered by the availability of features used in PON-P2. This dataset was used for training and testing PON-P2.

    DATASET 7

DATASET 8

PredictSNP dataset

The dataset was developed and used for the evaluation of prediction tools and for training of the consensus classifier PredictSNP.

    DATASET 8

DATASET 9

Dataset used by Riera et al.

Protein-specific and general pathogenicity predictors for amino acid substitutions

    DATASET 9

Reference: Riera C, Padilla N and de la Cruz X, 2016. The Complementarity Between Protein-Specific and General Pathogenicity Predictors for Amino Acid Substitutions. Hum Mutat 37:1013–1024.   PUBMED  

DATASET 10

Dataset used for FATHMM-XF

This dataset contains in F1 69141 SNVs from Inherited Disease (weighted) SwissProt/TrEMBL dataset from humsavar, F2 94995 cancer-associated pathogenic training cases and F3 69141 disease-specific SwissProt/TrEMBL (2014_05) humsavar variations.

        F1     F2     F3

Reference: Rogers, M. F., Shihab, H. A., Mort, M., Cooper, D. N., Gaunt, T. R., & Campbell, C. (2017). FATHMM-XF: accurate prediction of pathogenic point mutations via extended features. Bioinformatics (Oxford, England), 34(3), 511-513.  PUBMED  

DATASET 11

Dataset used for MutationTaster2

2600 disease-causing or benign variants from ClinVar and the 1000 Genomes Project (each of the 3 possible genotypes found in at least 50 samples). F2 contains 2200 disease-causing and benign variants from the 1000 Genomes Project. F3 contains 1100 pathogenic variants. F4 contains 1100 benign variants.

    F1     F2     F3     F4

Reference: Schwarz, M.J., Cooper, D.N., Schuelke, M., Seelow, D., MutationTaster2: mutation prediction for the deep-sequencing age, Nature Methods, 11(361), 2014/03/28/online.  PUBMED  

DATASET 12

Dataset used for VIPUR

Dataset F1 includes 9,477 variants (5,740 deleterious, 3737 neutral). F2 contains 1,542 human variants not included in HumDiv. F3 contains a subset of 383 variants found naturally in the human population. F4 contains 949 variants as a subset of the Human Outgroup set that includes only variants that have been identified in the human population. F5 contains 4992 variants in non-human proteins. F6 contains 6555 ClinVar variants with reliable structural models.

    F1     F2     F3     F4     F5     F6

Reference: Baugh, E. H., Simmons-Edler, R., Müller, C. L., Alford, R. F., Volfovsky, N., Lash, A. E., & Bonneau, R. (2016). Robust classification of protein variation using structural modelling and large-scale data integration. Nucleic acids research, 44(6), 2501-13.   PUBMED  

DATASET 13

Dataset used for BadMut 

Dataset of 33483 positive and negative nonsynonymous SNVs.

    F

Reference: Korvigo, I., Afanasyev, A., Romashchenko, N., & Skoblov, M. (2018). Generalising better: Applying deep learning to integrate deleteriousness prediction scores for whole-exome SNV studies. PloS one, 13(3), e0192829.doi:10.1371/journal.pone.0192829.  PUBMED  

DATASET 14

Dataset used for RAPSODY 

Variants from five datasets: HumVar, ExoVar, VariBenchSelected, predictSNPSelected and SwissVarSelected. 21946 single amino acid variants that have been structurally characterized.

    F

Reference: L Ponzoni, I Bahar, Structural dynamics is a determinant of the functional significance of missense variants, Proceedings of the National Academy of Sciences Apr 2018, 115 (16) 4164-4169; DOI: 10.1073/pnas.1715896115.  PUBMED  

DATASET 15

Dataset used for DANN 

CADD training data of 16 627 775 ‘observed’ and 49 407 057 ‘simulated’ variants.

  1. Training dataset : https://cbcl.ics.uci.edu/public_data/DANN/data/training.X.npz
  2. Validation dataset: https://cbcl.ics.uci.edu/public_data/DANN/data/validation.X.npz

Reference: Quang, D., Chen, Y., & Xie, X. (2014). DANN: a deep learning approach for annotating the pathogenicity of genetic variants. Bioinformatics (Oxford, England), 31(5), 761-3.  PUBMED  

DATASET 16

Dataset used for SAPs

Disease-associated single amino acid substitutions. F1 contains 876 proteins with 3257 disease-associated and 2118 benign variations. F2 is an independent dataset, which consisted of 218 proteins with 696 disease-associated and 456 benign variations.

    F1     F2

Reference: Li, Y., Wen, Z., Xiao, J., Yin, H., Yu, L., Yang, L., & Li, M. (2011). Predicting disease-associated substitution of a single amino acid by analyzing residue interactions. BMC bioinformatics, 12, 14. doi:10.1186/1471-2105-12-14.   PUBMED  

DATASET 17

Dataset used for SuSPect  

F1 contains 18633 variants from VariBench and F2 contains 64163 SAVs from Humsavar.

    F1     F2

Reference: Yates, C. M., Filippis, I., Kelley, L. A., & Sternberg, M. J. (2014). SuSPect: enhanced prediction of single amino acid variant (SAV) phenotype using network features. Journal of molecular biology, 426(14), 2692-701.   PUBMED  

DATASET 18

Dataset used for MAPPIN

64 Nonsynonymous nsSNVs from Centers for Mendelian Genomics (CMG), 158 variants from Deciphering Developmental Disorders Study (DDDS), 15702 and 3562 nonsynonymous EXOVAR and ClinVar disease-causing variants, 512370 variants from the 1000 Genomes Project, 51599 segmentally duplicated regions from hg19, 11763 nonsynonymous changes based on the GENCODE 19, 1048544 variants in ESP6500 dataset.

    F1     F2     F3     F4     F5     F6     F7     F8

Reference: Gosalia, N., Economides, A. N., Dewey, F. E., & Balasubramanian, S. (2017). MAPPIN: a method for annotating, predicting pathogenicity and mode of inheritance for nonsynonymous variants. Nucleic acids research, 45(18), 10393-10402.  PUBMED  

DATASET 19

Dataset used for comparing deleteriousness-scoring methods

Dataset F1 with 14 191 Mendelian disease-causing variations and 22 001 neutral variations. There are totally 88184 variations in test datasets F2, F3, F4, F5 and F6.

    F1     F2     F3     F4     F5     F6

Reference: Dong, C., Wei, P., Jian, X., Gibbs, R., Boerwinkle, E., Wang, K., & Liu, X. (2014). Comparison and integration of deleteriousness prediction methods for nonsynonymous SNVs in whole exome sequencing studies. Human molecular genetics, 24(8), 2125-37.   PUBMED  

DATASET 20

Dataset used for PhD-SNPg

48534 variants in training dataset in F1 and 1408 variants in test dataset in F2 from ClinVar database.

    F1     F2

Reference: Capriotti, E., & Fariselli, P. (2017). PhD-SNPg: a webserver and lightweight tool for scoring single nucleotide variants. Nucleic acids research, 45(W1), W247-W252.   PUBMED  

DATASET 21

Dataset used for MVP

1161 variants for multigene panel test (MGPT), for BRCA1, BRCA2, CDH1, PALB2, PTEN, TP53, MLH1, MSH2, MSH6 and PMS2 from ClinVar.

    F

Reference: Qian D, Li S, Tian Y, Clifford JW, Sarver BAJ, et al. (2018) A Bayesian framework for efficient and accurate variant prediction. PLOS ONE 13(9): e0203553. https://doi.org/10.1371/journal.pone.0203553.   PUBMED  

DATASET 22

Dataset used for drug absorption, distribution, metabolism and excretion (ADME) study

F1 contains 337 variants distributed across 43 ADME genes and F2 contains 180 loss-of function and neutral variants.

    F1     F2     F3

Reference: Zhou Y., Mkrtchian S., Kumondai, Masaki, Hiratsuka, Masahiro, Lauschke, Volker M., (2018), An optimized prediction framework to assess the functional impact of pharmacogenetic variants, The Pharmacogenomics Journal, pp. 1473-1150, DO - 10.1038/s41397-018-0044-2.   PUBMED  

DATASET 23

Dataset used for PredictSNP2

F1 contains 25480 nucleotide variants for Mendelian Diseases. F2 contains 12050 nucleotide variants for Complex diseases. F3 contains 142722 nucleotode variants for somatic cancers. F4 contains 16716 amino acid variants for Mendelian Diseases. F5 contains 71674 amino acid variants for somatic cancers.

    F1     F2     F3     F4     F5

Reference: Bendl J, Musil M, Štourač J, Zendulka J, Damborský J, Brezovský J (2016) PredictSNP2: A Unified Platform for Accurately Evaluating SNP Effects by Exploiting the Different Characteristics of Variants in Distinct Genomic Regions. PLoS Comput Biol 12(5): e1004962. doi:10.1371/journal. pcbi.1004962.   PUBMED  

DATASET 24

Dataset for PON-All

F1 contains 45573 variants of all species used for training. F2 contains 306 variants from animals used for training. F3 contains 5360 variants of all_species used as blind dataset. F4 contains 324 variants from animals used as blind dataset. F5 contains 3836 human variants used as blind dataset. F6 contains 1109 plant variants used as blind dataset. F7 contains 48176 human variants used as training dataset. F8 contains 4154 plant variants used as training dataset.

    F1     F2     F3     F4     F5     F6     F7     F8

Reference:Y.Yang, A.Shao, M.Vihinen, PON-All: Amino Acid Substitution Tolerance Predictor for All Organisms, Mol. Biosci., 16 June 2022 | https://doi.org/10.3389/fmolb.2022.867572   PUBMED  

DATASET 25

Mammalian disease variants

The file contains three sheets to 1) mouse dataset of 377 variants, 189 deleterious, 188 neutral, 2) dog dataset 207 variants, 103 deleterious, 104 neutral, 3) cattle dataset 62 variants, 30 deleterious, 32 neutral

    F

Reference: Plekhanova, E., Nuzhdin, S.V., Utkin, L.V., Samsonova, M.G. Prediction of deleterious mutations in coding regions of mammals with transfer learning. Wiley PMID: 30622632 PMCID: PMC6304693 DOI: 10.1111/eva.12607.   PUBMED  

DATASET 26

Arabidopsis thaliana variants

The file contains 2,617 amino acid altering mutations in 960 A. thaliana genes.

    F

Reference: Kono, T.J.Y., Lei, L., Shih, C.H., Hoffman, P.J., Morrell, P.L., and Fay, J.C. (2018). Comparative genomics approaches accurately predict deleterious variants in plants. G3 (Bethesda) 8, 3321-3329.   PUBMED  

DATASET 27

Arabidopsis data set

F contains 4409 variants

    F

Reference: Kovalev, M.S., Igolkina, A.A., Samsonova, M.G., and Nuzhdin, S.V. (2018). A pipeline for classifying deleterious coding mutations in agricultural plants. Front Plant Sci 9, 1734.  PUBMED  

DATASET 28

Dataset for MutPred2

    F

Reference: Pejaver, V., Urresti, J., Lugo-Martinez, J., Pagel, K.A., Lin, G.N., Nam, H.J., Mort, M., Cooper, D.N., Sebat, J., Iakoucheva, L.M., et al. (2020). Inferring the molecular and phenotypic impact of amino acid variants with MutPred2. Nat Commun 11, 5918.   PUBMED  

DATASET 29

Dataset for DeepSav

F1 contains 43000 pathogenic and F2 contains 43000 benign data from gnomAD database.

    F1     F2

Reference: J.Pei, L.N Kinch, Z.Otwinowski, N.V Grishin (2020). Mutation severity spectrum of rare alleles in the human genome is predictive of disease type. PLoS Comput Biol, 16.   PUBMED  

DATASET 30

Dataset for VARITY

F1 contains rare (MAF < 0.5%) ClinVar44 variants in the core set. F2 contains extremely rare (MAF < 10−6) ClinVar44 variants

    F1     F2

Reference: Y.Wu, R.Li, S.Sun, J.Weile, F.P Roth, Improved pathogenicity prediction for rare human missense variants, Am J Hum Genet;108(10):1891-1906. doi: 10.1016/j.ajhg.2021.08.012.   PUBMED  

DATASET 31

Dataset for MutScore

    F

Reference: Mathieu Quinodoz, Virginie G Peter, Katarina Cisarova, Beryl Royer-Bertrand, Peter D Stenson, David N Cooper, Sheila Unger, Andrea Superti-Furga, Carlo Rivolta, Analysis of missense variants in the human genome reveals widespread gene-specific clustering and improves prediction of pathogenicity, Am J Hum Genet. 2022 Mar 3;109(3):457-470. doi: 10.1016/j.ajhg.2022.01.006.   PUBMED  

DATASET 32

Dataset for MutFormer

    F

Reference: Jiang, T., Wang, K., Fang, L., MutFormer: A context-dependent transformer-based model to predict pathogenic missense mutations, https://doi.org/10.48550/arXiv.2110.14746  PUBMED  


Last updated: 2022-06-28 by Niloofar Shirvanizadeh.