VariBench_logo

A benchmark database for variations


Home | Instructions | Datasets | Citing | Disclaimer |


Disease-specific datasets

A. Cancer variation datasets

DATASET 1

TP53 cancer variations

This dataset consists of somatic variations leading to amino acid substitutions and that lead to loss of protein activity in tumours. It is a subset of variation dataset available in Curated TP53 database.

ClinVar cancer variations

This dataset consists of somatic variations leading to amino acid substitutions that are annotated as pathogenic in ClinVar database.

DoCM validated cancer variations

This dataset consists of validated cancer variations leading to amino acid substitutions. It is a subset of the dataset available in Database of Curated Mutations (DoCM).

Reference: Niroula A, Vihinen M (2015) Harmful somatic amino acid substitutions affect key pathways in cancers.BMC Med Genomics 8:53.   PUBMED  

DATASET 2

3706 amino acid substitutions in six oncogenes (BRAF, KIT, PIK3CA, KRAS, EGFR, ERRB2), six recently described cancer genes (ESR1, DICER1, MYOD1, IDH1, IDH2, SF3B1) and three tumor-suppressor genes (TSGs) (TP53, BRCA1, BRCA2).

    F

References: Martelotto, L. G., Ng, C. K., De Filippo, M. R., Zhang, Y., Piscuoglio, S., Lim, R. S., Shen, R., Norton, L., Reis-Filho, J. S. Weigelt, B. (2014). Benchmarking mutation effect prediction algorithms using functionally validated cancer-related missense mutations. Genome biology, 15(10), 484. doi:10.1186/s13059-014-0484-1.  PUBMED  

DATASET 3

Dataset used for MutaGene

5276 variants in 58 gene , 4137 neutral, 1139 non-neutral variants

    F

References: Goncearenco, A., Rager, S.L., Li, M., Sang, Q.X., Rogozin, I.B., Panchenko, A.R. (2017) Exploring background mutational processes to decipher cancer genetic heterogeneity. Nucleic Acids Res. 2017 Jul 3; 45(Web Server issue): W514-W522. Published online 2017 May 4. doi: 10.1093/nar/gkx367.  PUBMED  

DATASET 4

Dataset for TP53_PROF

F1 is the negative set, protein variants that were never found (693 variants) or found only once (323 variants) in human cancer have been selected (no_cancer p53 variants)

F2 is the final curated dataset of 1294 variants. The negative set contains 1016 variants (1011 after removing variants kept for experimental validation) and the positive set contains 290 variants (283 after removing variants kept for experimental validation).

    F1     F2

References: Ben-Cohen, G., Doffe, F., Devir, M., Leroy, B., Soussi, T., Rosenberg, S., TP53_PROF: a machine learning model to predict impact of missense mutations in TP53, Brief Bioinform. 2022 Mar 10;23(2):bbab524. doi: 10.1093/bib/bbab524..  PUBMED  

DATASET 5

Dataset used for dbCPM

The database of Cancer Passenger Mutations (dbCPM) for curated passenger variations that are unlikely to engage in cancer development, progression, or therapy. dbCPM currently contains 941 experimentally supported variants.

    F1    F2    F3    

Reference: Yue, Z., Zhao, L., Xia, J., dbCPM: a manually curated database for exploring the cancer passenger mutations, Briefings in Bioinformatics,2018 , bby105, https://doi.org/10.1093/bib/bby105.  PUBMED  

DATASET 6

Dataset used for OncoKB

OncoKB currently contains 4472 alterations for 595 genes of 38 tumor types.

    https://oncokb.org/

Reference: Chakravarty, D., Gao, J., Phillips, S. M., Kundra, R., Zhang, H., Wang, J., Schultz, N. (2017). OncoKB: A Precision Oncology Knowledge Base. JCO precision oncology, 2017, 10.1200/PO.17.00011. doi:10.1200/PO.17.00011.  PUBMED  

DATASET 7

Dataset used for DoCM

DoCM currently contains 1364 curated somatic variants in cancer.

    http://docm.info/

Reference: Ainscough, B. J., Griffith, M., Coffman, A. C., Wagner, A. H., Kunisaki, J., Choudhary, M. N., Mardis, E. R. (2016). DoCM: a database of curated mutations in cancer. Nature methods, 13(10), 806-807. doi:10.1038/nmeth.4000.  PUBMED  

DATASET 8

Dataset for driver insertions and deletions in dbCID

210 experimentally supported and 728 putative driver variants

        F1    F2    F3    

References: Yue, Z., Zhao, L., Cheng, N., Yan, H., Xia, J., dbCID: a manually curated resource for exploring the driver indels in human cancer, Brief Bioinform;20(5):1925-1933. doi: 10.1093/bib/bby059.  PUBMED  

DATASET 9

Dataset for MutaGene, driver variant prediction

5276 variants in 58 gene , 4137 neutral, 1139 non-neutral variants

    F

References: Goncearenco, A., Rager, S.L., Li, M., Sang, Q.X., Rogozin, I.B., Panchenko, A.R., Exploring background mutational processes to decipher cancer genetic heterogeneity, Nucleic Acids Res;45(W1):W514-W522. doi: 10.1093/nar/gkx367.  PUBMED  

DATASET 10

Dataset for cancer variants

164 cancer-related amino acid substitutions in 11 protein.

    F

Reference:Petrosino, M., Novak, L., Pasquo, A., Chiaraluce, R., Turina, P., Capriotti, E., and Consalvi, V. (2021). Analysis and Interpretation of the Impact of Missense Variants in Cancer. International journal of molecular sciences 22.  PUBMED  

B. Other diseases

DATASET 1

Dataset used for LQTS classification

312 amino acid substitution-causing variant in 3 long QT syndrome gene SCN5A in F10, LQT in transmembrane domain (N/TM/C) in F11, LQT in loop regions in F12. Different combinations of KCNQ1 (all in F13, pathogenic in F14 and benign in F15), KCNH2 (all in F1, pathogenic in F2 and benign in F3) and SCN5A genes (pathogenic in F6 and benign in F7)are given in other datasets. Only whole of SCN5A in F5, only the amino-/carboxyl-terminus and transmembrane domain (N/TM/C) of SCN5A in F8, and only the loop regions of SCN5A variants in F9 are given in different dataset files. SNVs in SCN5A gene causing Brugada syndrome (BrS) are in F4. All variants are combined in F16.

    F1     F2     F3     F4     F5     F6     F7     F8     F9     F10     F11     F12     F13     F14     F15     F16

References: Leong, I. U., Stuckey, A., Lai, D., Skinner, J. R., & Love, D. R. (2015). Assessment of the predictive accuracy of five in silico prediction tools, alone or in combination, and two metaservers to classify long QT syndrome gene mutations. BMC medical genetics, 16, 34. doi:10.1186/s12881-015-0176-z.  PUBMED  

DATASET 2

Dataset used for PolyPhen-HCM

74 gold standard variants for hypertrophic cardiomyopathy (HCM). F2 contains all possible 78983 variants of six HCM genes , namely - MYBPC3, MYH7, MYL2, TNNI3, TNNT2 and TPM1 .

    F1     F2

References: Jordan, D. M., Kiezun, A., Baxter, S. M., Agarwala, V., Green, R. C., Murray, M. F., Pugh, T., Lebo, M. S., Rehm, H. L., Funke, B. H., Sunyaev, S. R. (2011). Development and validation of a computational method for assessment of missense variants in hypertrophic cardiomyopathy. American journal of human genetics, 88(2), 183-92.  PUBMED  

DATASET 3

Dataset used for FASMIC

Consensus functional annotations for 1,049 unique variations (F1) and 95 wild-type genes (F2) using both Ba/F3 and MCF10A models. F3 contains 40 wild-type variations from an independent repeat experiment of four allelic series, BRAF, EGFR, PIK3CA, and ERBB2. Datasets of cancer variants of unknown significance (VUS) from (Kim et al., 2016) and (Berger et al., 2016) and also comparing with OncoKB datasets are included in F5, F6, F7 and F8. F4 contains variations for Cell viability of Ba/F3 cell lines. 22 weak activating variations are in F9.

    F1     F2     F3     F4     F5     F6     F7     F8     F9

References: Kwok-Shing, P., Li, J., Jeong, K.J., Shao, S., Chen, H., Tsang, Y.H., Sengupta, S., Wang, Z., Bhavana, V.H., Tran, R., Soewito, S., Minussi, D.C., Moreno, D., Kong, K., Dogruluk, T., Lu, H., Gao, J., Tokheim, C., Zhou, D.C., Johnson, A.M., Zeng, J., Ka Man , C., Ju, Z., Wester, M., Yu, S., Li, Y., Vellano, C.P., Schultz, N., Karchin, R., Ding, L., Lu, Y., Cheung, L.W.T., Chen, K., Shaw, K.R., Meric-Bernstam, F., Scott, K.L., Yi, S., Sahni, N., Liang, H., Mills, G.B., (2018) Systematic Functional Annotation of Somatic Mutations in Cancer. Cancer Cell, Volume 33, Issue 3, 2018, Pages 450-462.e10, ISSN 1535-6108, https://doi.org/10.1016/j.ccell.2018.01.021.  PUBMED  

DATASET 4

Dataset for interaction networks in e-MutPath

59712 variants

    F

References: Li, Y., Burgman, B., Khatri, I.S., Pentaparthi, S.R., Su, Z., McGrail, D.J., Li, G., Wu, E., Eckhardt, S.G., Sahni, N., Yi, S.S., e-MutPath: computational modeling reveals the functional landscape of genetic mutations rewiring interactome networks, Nucleic Acids Res;49(1):e2. doi: 10.1093/nar/gkaa1015.   PUBMED  

DATASET 5

Dataset for SCN9A variant predictor

31 pathogenic and 54 neutral variants

    F

References: Toffano, A.A., Chiarot, G., Zamuner, S. et al, Computational pipeline to probe NaV1.7 gain-of-function variants in neuropathic painful syndromes, Sci Rep;10(1):17930. doi: 10.1038/s41598-020-74591-y.  PUBMED  

DATASET 6

Dataset for CardioBoost

Arrhythmia (F1 contains additional benign data, test, F2 contains additional pathogenic test data, F3 contains raw holdout test data, F4 contains raw training data, F5 contains train data)

Cardiomyopathy (F1 contains additional test benign data, F2 contains test pathogenic data, F3 contains holdout test data, F4 contains raw training data)

Cardiomyopathy alternative (dcm (F1 contains holdout test data, F2 contains raw train data, F3 contains train data) hcm (F1 contains holdout test data, F2 contains raw train data, F3 contains train data) mybpc3_hcm (F1 contains holdout test data, F2 contains raw train data, F3 contains train data) myh7_dcm (F1 contains holdout test data, F2 contains raw train data, F3 contains train data) myh7_hcm (F1 contains holdout test data, F2 contains raw train data, F3 contains train data))

    Arrhythmia (test      F1     F2     F3     train     F4     F5)
    Cardiomyopathy (test     F1     F2     F3     train     F4)
    Cardiomyopathy alternative dcm (test     F1     train     F2)
    Cardiomyopathy alternative hcm (test     F1      train     F2     F3)
    Cardiomyopathy alternative mybpc3_hcm (test     F1      train     F2     F3)
    Cardiomyopathy alternative myh7_dcm (test     F1      train     F3     F4)
    Cardiomyopathy alternative myh7_hcm (test     F1      train     F2     F3)

References: Zhang, X., Walsh, R., Whiffin, N., Buchan, R. et al., Disease-specific variant pathogenicity prediction significantly improves variant interpretation in inherited cardiac conditions,Genet Med . 2021 Jan;23(1):69-79. doi: 10.1038/s41436-020-00972-3.  PUBMED  

DATASET 7

Dataset for steroid metabolism diseases

The in vitro functional characterization findings from the references listed and prediction outcomes from SIFT, PolyPhen2 and PON-P.

        F

References: A Chan, Performance of in silico analysis in predicting the effect of non-synonymous variants in inherited steroid metabolic diseases, Steroids . 2013 Jul;78(7):726-30. doi: 10.1016/j.steroids.2013.04.002.  PUBMED  


Last updated: 2021-02-23 by Niloofar Shirvanizadeh.