| Home | Instructions | Datasets | Citing | Disclaimer | |
DATASET 1
This dataset consists of somatic variations leading to amino acid substitutions and that lead to loss of protein activity in tumours. It is a subset of variation dataset available in Curated TP53 database.
This dataset consists of somatic variations leading to amino acid substitutions that are annotated as pathogenic in ClinVar database.
DoCM validated cancer variations
This dataset consists of validated cancer variations leading to amino acid substitutions. It is a subset of the dataset available in Database of Curated Mutations (DoCM).
Reference: Niroula A, Vihinen M (2015) Harmful somatic amino acid substitutions affect key pathways in cancers.BMC Med Genomics 8:53. PUBMED
DATASET 2
3706 amino acid substitutions in six oncogenes (BRAF, KIT, PIK3CA, KRAS, EGFR, ERRB2), six recently described cancer genes (ESR1, DICER1, MYOD1, IDH1, IDH2, SF3B1) and three tumor-suppressor genes (TSGs) (TP53, BRCA1, BRCA2).
References: Martelotto, L. G., Ng, C. K., De Filippo, M. R., Zhang, Y., Piscuoglio, S., Lim, R. S., Shen, R., Norton, L., Reis-Filho, J. S. Weigelt, B. (2014). Benchmarking mutation effect prediction algorithms using functionally validated cancer-related missense mutations. Genome biology, 15(10), 484. doi:10.1186/s13059-014-0484-1. PUBMED
DATASET 3
Dataset used for MutaGene
5276 variants in 58 gene , 4137 neutral, 1139 non-neutral variants
References: Goncearenco, A., Rager, S.L., Li, M., Sang, Q.X., Rogozin, I.B., Panchenko, A.R. (2017) Exploring background mutational processes to decipher cancer genetic heterogeneity. Nucleic Acids Res. 2017 Jul 3; 45(Web Server issue): W514-W522. Published online 2017 May 4. doi: 10.1093/nar/gkx367. PUBMED
DATASET 4Dataset for TP53_PROF
F1 is the negative set, protein variants that were never found (693 variants) or found only once (323 variants) in human cancer have been selected (no_cancer p53 variants)
F2 is the final curated dataset of 1294 variants. The negative set contains 1016 variants (1011 after removing variants kept for experimental validation) and the positive set contains 290 variants (283 after removing variants kept for experimental validation).
References: Ben-Cohen, G., Doffe, F., Devir, M., Leroy, B., Soussi, T., Rosenberg, S., TP53_PROF: a machine learning model to predict impact of missense mutations in TP53, Brief Bioinform. 2022 Mar 10;23(2):bbab524. doi: 10.1093/bib/bbab524.. PUBMED
DATASET 5
Dataset used for dbCPM
The database of Cancer Passenger Mutations (dbCPM) for curated passenger variations that are unlikely to engage in cancer development, progression, or therapy. dbCPM currently contains 941 experimentally supported variants.
Reference: Yue, Z., Zhao, L., Xia, J., dbCPM: a manually curated database for exploring the cancer passenger mutations, Briefings in Bioinformatics,2018 , bby105, https://doi.org/10.1093/bib/bby105. PUBMED
DATASET 6
Dataset used for OncoKB
OncoKB currently contains 4472 alterations for 595 genes of 38 tumor types.
Reference: Chakravarty, D., Gao, J., Phillips, S. M., Kundra, R., Zhang, H., Wang, J., Schultz, N. (2017). OncoKB: A Precision Oncology Knowledge Base. JCO precision oncology, 2017, 10.1200/PO.17.00011. doi:10.1200/PO.17.00011. PUBMED
DATASET 7
Dataset used for DoCM
DoCM currently contains 1364 curated somatic variants in cancer.
Reference: Ainscough, B. J., Griffith, M., Coffman, A. C., Wagner, A. H., Kunisaki, J., Choudhary, M. N., Mardis, E. R. (2016). DoCM: a database of curated mutations in cancer. Nature methods, 13(10), 806-807. doi:10.1038/nmeth.4000. PUBMED
DATASET 8Dataset for driver insertions and deletions in dbCID
210 experimentally supported and 728 putative driver variants
References: Yue, Z., Zhao, L., Cheng, N., Yan, H., Xia, J., dbCID: a manually curated resource for exploring the driver indels in human cancer, Brief Bioinform;20(5):1925-1933. doi: 10.1093/bib/bby059. PUBMED
DATASET 9Dataset for MutaGene, driver variant prediction
5276 variants in 58 gene , 4137 neutral, 1139 non-neutral variants
References: Goncearenco, A., Rager, S.L., Li, M., Sang, Q.X., Rogozin, I.B., Panchenko, A.R., Exploring background mutational processes to decipher cancer genetic heterogeneity, Nucleic Acids Res;45(W1):W514-W522. doi: 10.1093/nar/gkx367. PUBMED
DATASET 10
164 cancer-related amino acid substitutions in 11 protein.
Reference:Petrosino, M., Novak, L., Pasquo, A., Chiaraluce, R., Turina, P., Capriotti, E., and Consalvi, V. (2021). Analysis and Interpretation of the Impact of Missense Variants in Cancer. International journal of molecular sciences 22. PUBMED
DATASET 1
Dataset used for LQTS classification
312 amino acid substitution-causing variant in 3 long QT syndrome gene SCN5A in F10, LQT in transmembrane domain (N/TM/C) in F11, LQT in loop regions in F12. Different combinations of KCNQ1 (all in F13, pathogenic in F14 and benign in F15), KCNH2 (all in F1, pathogenic in F2 and benign in F3) and SCN5A genes (pathogenic in F6 and benign in F7)are given in other datasets. Only whole of SCN5A in F5, only the amino-/carboxyl-terminus and transmembrane domain (N/TM/C) of SCN5A in F8, and only the loop regions of SCN5A variants in F9 are given in different dataset files. SNVs in SCN5A gene causing Brugada syndrome (BrS) are in F4. All variants are combined in F16.
References: Leong, I. U., Stuckey, A., Lai, D., Skinner, J. R., & Love, D. R. (2015). Assessment of the predictive accuracy of five in silico prediction tools, alone or in combination, and two metaservers to classify long QT syndrome gene mutations. BMC medical genetics, 16, 34. doi:10.1186/s12881-015-0176-z. PUBMED
DATASET 2
Dataset used for PolyPhen-HCM
74 gold standard variants for hypertrophic cardiomyopathy (HCM). F2 contains all possible 78983 variants of six HCM genes , namely - MYBPC3, MYH7, MYL2, TNNI3, TNNT2 and TPM1 .
References: Jordan, D. M., Kiezun, A., Baxter, S. M., Agarwala, V., Green, R. C., Murray, M. F., Pugh, T., Lebo, M. S., Rehm, H. L., Funke, B. H., Sunyaev, S. R. (2011). Development and validation of a computational method for assessment of missense variants in hypertrophic cardiomyopathy. American journal of human genetics, 88(2), 183-92. PUBMED
DATASET 3
Dataset used for FASMIC
Consensus functional annotations for 1,049 unique variations (F1) and 95 wild-type genes (F2) using both Ba/F3 and MCF10A models. F3 contains 40 wild-type variations from an independent repeat experiment of four allelic series, BRAF, EGFR, PIK3CA, and ERBB2. Datasets of cancer variants of unknown significance (VUS) from (Kim et al., 2016) and (Berger et al., 2016) and also comparing with OncoKB datasets are included in F5, F6, F7 and F8. F4 contains variations for Cell viability of Ba/F3 cell lines. 22 weak activating variations are in F9.
References: Kwok-Shing, P., Li, J., Jeong, K.J., Shao, S., Chen, H., Tsang, Y.H., Sengupta, S., Wang, Z., Bhavana, V.H., Tran, R., Soewito, S., Minussi, D.C., Moreno, D., Kong, K., Dogruluk, T., Lu, H., Gao, J., Tokheim, C., Zhou, D.C., Johnson, A.M., Zeng, J., Ka Man , C., Ju, Z., Wester, M., Yu, S., Li, Y., Vellano, C.P., Schultz, N., Karchin, R., Ding, L., Lu, Y., Cheung, L.W.T., Chen, K., Shaw, K.R., Meric-Bernstam, F., Scott, K.L., Yi, S., Sahni, N., Liang, H., Mills, G.B., (2018) Systematic Functional Annotation of Somatic Mutations in Cancer. Cancer Cell, Volume 33, Issue 3, 2018, Pages 450-462.e10, ISSN 1535-6108, https://doi.org/10.1016/j.ccell.2018.01.021. PUBMED
DATASET 4Dataset for interaction networks in e-MutPath
59712 variants
References: Li, Y., Burgman, B., Khatri, I.S., Pentaparthi, S.R., Su, Z., McGrail, D.J., Li, G., Wu, E., Eckhardt, S.G., Sahni, N., Yi, S.S., e-MutPath: computational modeling reveals the functional landscape of genetic mutations rewiring interactome networks, Nucleic Acids Res;49(1):e2. doi: 10.1093/nar/gkaa1015. PUBMED
DATASET 5Dataset for SCN9A variant predictor
31 pathogenic and 54 neutral variants
References: Toffano, A.A., Chiarot, G., Zamuner, S. et al, Computational pipeline to probe NaV1.7 gain-of-function variants in neuropathic painful syndromes, Sci Rep;10(1):17930. doi: 10.1038/s41598-020-74591-y. PUBMED
DATASET 6Dataset for CardioBoost
Arrhythmia (F1 contains additional benign data, test, F2 contains additional pathogenic test data, F3 contains raw holdout test data, F4 contains raw training data, F5 contains train data)
Cardiomyopathy (F1 contains additional test benign data, F2 contains test pathogenic data, F3 contains holdout test data, F4 contains raw training data) Cardiomyopathy alternative (dcm (F1 contains holdout test data, F2 contains raw train data, F3 contains train data) hcm (F1 contains holdout test data, F2 contains raw train data, F3 contains train data) mybpc3_hcm (F1 contains holdout test data, F2 contains raw train data, F3 contains train data) myh7_dcm (F1 contains holdout test data, F2 contains raw train data, F3 contains train data) myh7_hcm (F1 contains holdout test data, F2 contains raw train data, F3 contains train data))References: Zhang, X., Walsh, R., Whiffin, N., Buchan, R. et al., Disease-specific variant pathogenicity prediction significantly improves variant interpretation in inherited cardiac conditions,Genet Med . 2021 Jan;23(1):69-79. doi: 10.1038/s41436-020-00972-3. PUBMED
DATASET 7Dataset for steroid metabolism diseases
The in vitro functional characterization findings from the references listed and prediction outcomes from SIFT, PolyPhen2 and PON-P.
References: A Chan, Performance of in silico analysis in predicting the effect of non-synonymous variants in inherited steroid metabolic diseases, Steroids . 2013 Jul;78(7):726-30. doi: 10.1016/j.steroids.2013.04.002. PUBMED
Last updated: 2021-02-23 by Niloofar Shirvanizadeh.