VariBench_logo

A benchmark database for variations


Home | Instructions | Datasets | Citing | Disclaimer |


Molecule-specific datasets

MMR missense variants

DATASET 1

Dataset used for PON-MMR

168 experimentally verified mismatch repair system (MMR) amino acid substitutions with known functional effect from the literature. There are

  1. 88 variations in the neutral dataset. Neutral_Variants
  2. 80 variations in the pathogenic dataset. Pathogenic_Variants

Reference: Ali H, Olatubosun A, Vihinen M. Classification of mismatch repair gene missense variants with PON-MMR. Hum Mutat. 2012 Apr; 33(4):642-50. doi: 10.1002/humu.22038.  PUBMED  

DATASET 2

Dataset used for PON-MMR2

224 validated MMR amino acid substitutions with known functional effect from InSiGHT database. There are

  1. 178 (89 pathogenic and 89 neutral) variations in the training dataset. Training dataset
  2. 46 (32 pathogenic and 14 neutral) variations in the test dataset. Test dataset

Reference: Niroula A, Vihinen M. 2015. Classification of amino acid substitutions in mismatch repair proteins using PON-MMR2. Hum Mutat 36(12):1128-1134  PUBMED  

DATASET 3

Dataset used for PON-mt-tRNA

146 single nucleotide substitutions in human mitochondrial tRNAs. There are 91 pathogenic and 55 neutral variations in the dataset.

    PON-mt-tRNA training and test dataset

Reference: Niroula A, Vihinen M. 2016. PON-mt-tRNA: a multifactorial probability-based method for classification of mitochondrial tRNA variations. Nucleic Acids Res 44(5):2020-2027.  PUBMED  

DATASET 4

Dataset used for PON-BTK

152 disease (XLA)-associated single amino acid-substitution caused amino acid substitions (SNAVs) in 91 residues.

Reference: Valiaho, J. , Faisal, I. , Ortutay, C. , Smith, C. I. and Vihinen, M. (2015), Characterization of all Possible Single-Nucleotide Change Caused Amino Acid Substitutions in the Kinase Domain of Bruton Tyrosine Kinase. Human Mutation, 36: 638-647. doi:10.1002/humu.22791.  PUBMED  

DATASET 5

Dataset used for Kinact

384 amino acid substitutions in protein kinases in F1, 258 of which were mapped to experimentally solved 3D structures in F2.

        F1,      F2

Reference: Rodrigues, C. H., Ascher, D. B., & Pires, D. E. (2018). Kinact: a computational approach for predicting activating missense mutations in protein kinases. Nucleic acids research, 46(W1), W127-W132.  PUBMED  

DATASET 6

Dataset used for KinMutBase

KinMutBase is a comprehensive knowledge base for human diseaserelated variations in protein kinase domains. The latest version contains 1414 variations.

    https://structure-next.med.lu.se/idbase/KinMutBase/

Reference: Ortutay, C. , Valiaho, J. , Stenberg, K. and Vihinen, M. (2005), KInMutBase: A registry of disease-causing mutations in protein kinase domains. Hum. Mutat., 25: 435-442. doi:10.1002/humu.20166.  PUBMED  

DATASET 7

Dataset used for Kin-Driver

Somatic variations in protein kinases with experimental evidence demonstrating their functional role. Database v82 contains 783 variations.

        F1,      F2

Reference: Simonetti, F. L., Tornador, C., Nabau-Moret, N., Molina-Vila, M. A., & Marino-Buslje, C. (2014). Kin-Driver: a database of driver mutations in protein kinases. Database : the journal of biological databases and curation, 2014, bau104. doi:10.1093/database/bau104.  PUBMED  

DATASET 8

Nonsynonymous coding SNVs in protein kinases. F1 contains 1463 disease-causing variants, F2 999 unknown disease causing (uDCs) variants and F3 contains 302 benign variants from Swiss-Prot.

        F1,      F2     F3

References: A Torkamani, N J. Schork; Accurate prediction of deleterious protein kinase polymorphisms, Bioinformatics, Volume 23, Issue 21, 1 November 2007, Pages 2918-2925, https://doi.org/10.1093/bioinformatics/btm437.  PUBMED  
A Torkamani, N J. Schork, (2007) Distribution analysis of nonsynonymous polymorphisms within the human kinase gene family. Genomics, Volume 90, Issue 1, 2007, Pages 49-58, ISSN 0888-7543, https://doi.org/10.1016/j.ygeno.2007.03.006.  PUBMED  

DATASET 9

Dataset for wKinMut

865 and 2627 disease-causing and neutral non-synonymous variants in human protein kinases.

        F1,      F2

Reference: Izarzugaza, J. M., Vazquez, M., del Pozo, A., & Valencia, A. (2013). wKinMut: an integrated tool for the analysis and interpretation of mutations in human protein kinases. BMC bioinformatics, 14, 345. doi:10.1186/1471-2105-14-345.  PUBMED  

DATASET 10

Dataset used for PTENpred

676 nonsynonymous SNVs in a tumor-suppressor PTEN.

        F1

Reference: Johnston, S. B., & Raines, R. T. (2016). PTENpred: A Designer Protein Impact Predictor for PTEN-related Disorders. Journal of computational biology : a journal of computational molecular cell biology, 23(12), 969-975.  PUBMED  

DATASET 11

protein-specific and general pathogenicity predictors for amino acid substitutions

Pathogenic and neutral variants for 82 proteins used to compare generic and protein specific predictors.

        Riera_dataset.zip 1872222 in 82 files

Reference: Riera C, Padilla N and de la Cruz X, 2016. The Complementarity Between Protein-Specific and General Pathogenicity Predictors for Amino Acid Substitutions. Hum Mutat 37:1013–1024  PUBMED  

DATASET 12

166 damaging and 21 benign amino acid substitutions in neurodegenerative disorder Niemann-Pick disease type C (NP-C).
        F1

Reference: Adebali, O., Reznik, A. O., Ory, D. S., & Zhulin, I. B. (2016). Establishing the precise evolutionary history of a gene improves prediction of disease-causing missense mutations. Genetics in medicine : official journal of the American College of Medical Genetics, 18(10), 1029-36.  PUBMED  

DATASET 13

Dataset used for DPYDVarifier

Deleterious variants in dihydropyrimidine dehydrogenase (DPD, DPYD gene). F1 contains 69 variants with 30% or greater reduction in activity compared to wild type DPD. F2 contains 295 germline variants reported in dbSNP.

        F1,      F2

References: Hamzic S, Amstutz U, Largiader C, Come a long way, still a ways to go: from predicting and preventing fluoropyrimidine toxicity to increased efficacy?, Pharmacogenomics, 10.2217/pgs-2018-0040, 19, 8, (689-692), (2018).  PUBMED  
Shrestha S, Zhang C, Jerde C, Nie Q, Li H, Offer S, Diasio R (2018). Gene-Specific Variant Classifier (DPYD-Varifier) to Identify Deleterious Alleles of Dihydropyrimidine Dehydrogenase, CLINICAL PHARMACOLOGY & THERAPEUTICS, 104(4), 709-718.   PUBMED  

DATASET 14

Database of BRCA1/2 missense variants

F1 contains 201 sequence alterations in BRCA1 or BRCA2 in a cohort of 523 index patients of families with HBOC. F2 contains 68 missense variants in BRCA1 or BRCA2 in a cohort of 523 index patients of families with HBOC.
        F1,      F2

Reference: Sadowski C, Kohlstedt D, Meisel C, Keller K, Becker K, Mackenroth L, Rump A, Schrck E, Wimberger P, Kast K, BRCA1/2 missense mutations and the value of in-silico analyses, European Journal of Medical Genetics, Volume 60, Issue 11, 2017, Pages 572-577, ISSN 1769-7212, https://doi.org/10.1016/j.ejmg.2017.08.005.  PUBMED  

DATASET 15

20 Cystic fibrosis transmembrane conductance regulator (CFTR) nucleotide-binding domain (NBD) variants in F1. F2 contains 11 newly characterized NBD variants.
        F1,      F2

Reference: Masica, D. L., Sosnay, P. R., Raraigh, K. S., Cutting, G. R., & Karchin, R. (2014). Missense variants in CFTR nucleotide-binding domains predict quantitative phenotypes associated with cystic fibrosis disease severity. Human molecular genetics, 24(7), 1908-17..  PUBMED  

DATASET 16

Dataset for HApredictor

1138 factor VIII amino acid substitutions from hemophilia A (HA) patients.

        F1

Reference: Hamasaki-Katagiri, N., Salari, R., Wu, A., Qi, Y., Schiller, T., Filiberto, A. C., Schisterman, E. F., Komar, A. A., Przytycka, T. M., Kimchi-Sarfaty, C. (2013). A gene-specific method for predicting hemophilia-causing point mutations. Journal of molecular biology, 425(21), 4023-33.  PUBMED  

DATASET 17

Dataset for MutaCYP

Cytochrome P450 monooxygenase (CYP) variation datasets. F1 is a control set CS30, F2 is a training dataset of 285 variants in 15 CYPs. F3 contains 328 variants in blind dataset, where association with a disease is not entirely clear.

        F1,      F2     F3

Reference: Fechter, K., & Porollo, A. (2014). MutaCYP: Classification of missense mutations in human cytochromes P450. BMC medical genomics, 7, 47. doi:10.1186/1755-8794-7-47.   PUBMED  

DATASET 18

Non-synonymous single nucleotide variants in voltage-gated potassium (Kv) channels causing diseases. F1 contains 1259 variants in training dataset and F2 contains 176 variants in test dataset.

        F1,      F2

Reference: L. F. Stead, I. C. Wood, D. R. Westhead (2011) KvSNP: accurately predicting the effect of genetic variants in voltage-gated potassium channels, Bioinformatics, Volume 27, Issue 16, 15 August 2011, Pages 2181-2186, https://doi.org/10.1093/bioinformatics/btr365.  PUBMED  

DATASET 19

Dataset for CFTR-MetaPred

Cystic fibrosis transmembrane conductance regulator (CFTR). F1 contains 1899 variants of clinical significance and F2 contains subset of 1210 amino acid substitutions

        F1     F2

Reference: Rychkova A, Buu M, Scharfe C, Lefterova M, Odegaard J, Schrijver I, Milla C, Bustamante C, Developing Gene-Specific Meta-Predictor of Variant Pathogenicity, doi: https://doi.org/10.1101/115956   PUBMED  

DATASET 20

Dataset for CYSMA, CFTR amino acid substitution predictor

Dataset of 128 disease-causing and 13 non-disease-causing variants

        F

Reference: Sasorith S, David Baux D, Bergougnoux A, Paulet D, Lahure A, Bareil C, Taulan-Cadars M, Roux A, Koenig M, Claustres M, Raynal C, The CYSMA web server: An example of integrative tool for in silico analysis of missense variants identified in Mendelian disorders, Hum Mutat;41(2):375-386. doi: 10.1002/humu.23941.  PUBMED  

DATASET 21

Dataset for KinMutRF, Disease-related protein kinase family variants KinMutRF

        F

Reference: Pons T, Vazquez M, Matey-Hernandez M, Brunak S, Valencia A, Izarzugaza J, KinMutRF: a random forest classifier of sequence variants in the human protein kinase superfamily, BMC Genomics;17 Suppl 2(Suppl 2):396. doi: 10.1186/s12864-016-2723-1.  PUBMED  

DATASET 22

Cardiac sodium channel variants

1392 variants, 370 pathogenic, 602 benign, 420 UVs

        F

Reference: Tarnovskaya S, Korkosh V, Zhorov B, Frishman D, Predicting novel disease mutations in the cardiac sodium channel, Biochem Biophys Res Commun;521(3):603-611. doi: 10.1016/j.bbrc.2019.10.142  PUBMED  

DATASET 23

Dataset for SCN9A variants

31 pathogenic and 54 neutral variants

        F

Reference: Toffano A, Chiarot G, Zamuner S, Marchi M, Salvi E, Waxman S, Faber C, Lauria G, Giacometti A, Simeoni M, Computational pipeline to probe NaV1.7 gain-of-function variants in neuropathic painful syndromes, Sci Rep;10(1):17930. doi: 10.1038/s41598-020-74591-y.  PUBMED  

DATASET 24

Dataset for troponin variants

136 pathogenic or likely pathogenic amino acid substitutions in Tn genes: 13 in cardiac TnC (TNNC1), 65 in cardiac TnT (TNNT2) and 58 in cardiac TnI (TNNI3)

        F

Reference: Shakur R, Ochoa J, Robinson A, Niroula A, Chandran A, Rahman T, Vihinen M, Monserrat L, Prognostic implications of troponin T variations in inherited cardiomyopathies using systems biology, NPJ Genom Med;6(1):47. doi: 10.1038/s41525-021-00204-w.  PUBMED  

DATASET 25

Dataset for IDUA

        F

Reference: Borges P, Pasqualim G, Matte U, Which Is the Best In Silico Program for the Missense Variations in IDUA Gene? A Comparison of 33 Programs Plus a Conservation Score and Evaluation of 586 Missense Variants, Front Mol Biosci. 2021 Oct 21;8:752797. doi: 10.3389/fmolb.2021.752797.  PUBMED  


Last updated: 2021-02-24 by Niloofar Shirvanizadeh.