| Home | Instructions | Datasets | Citing | Disclaimer | |
DATASET 1
Mendelian regulatory variations including 42 enhancer, 142 promoter, 153 5' UTR, 43 3' UTR, 65 RNA gene, 3 imprinting control region, and 5 microRNA gene variations.
Reference: Smedley, D., Schubach, M., Jacobsen, J., Köhler, S., Zemojtel, T., Spielmann, M., Jäger, M., Hochheiser, H., Washington, N. L., McMurry, J. A., Haendel, M. A., Mungall, C. J., Lewis, S. E., Groza, T., Valentini, G., … Robinson, P. N. (2016). A Whole-Genome Analysis Framework for Effective Identification of Pathogenic Regulatory Variants in Mendelian Disease. American Journal of Human Genetics, 99(3), 595-606. PUBMEDDATASET 2
27558 Mendelian disease regulatory variants from OMIM and ClinVar, 20963 complex disease regulatory variants from VarDi and NHGRI GWAS Catalog and 43364 recurrent cancer somatic variations.
Reference: Ma, M., Ru, Y., Chuang, L. S., Hsu, N. Y., Shi, L. S., Hakenberg, J., Cheng, W. Y., Uzilov, A., Ding, W., Glicksberg, B. S., Chen, R. (2015). Disease-associated variants in different categories of disease located in distinct regulatory elements. BMC Genomics, 16 Suppl 8(Suppl 8), S3. PUBMED
DATASET 3
225 Functional regulatory SNVs in monogenic and complex diseases and 241910 SNVs from dbSNP as a negative control dataset.
Reference: Zhao, Y., Clark, W. T., Mort, M., Cooper, D. N., Radivojac, P., & Mooney, S. D. (2011). Prediction of functional regulatory SNPs in monogenic and complex disease. Human Mutation, 32(10), 1183-90. PUBMEDDATASET 4
7948 control SNVs used for training and testing, 4044 control SNVs from HepG2, 2693 SNVs of dsQTL, 51 deSNVs and 156 enhancer SNVs in B-cells from NHGRI GWAS Catalog, 56497 GM12878 enhancer SNVs and 2029 variants in training set of dsQTL model within hotspot DHS region of non-blood cell.
Reference: Li, S., Alvarez, R. V., Sharan, R., Landsman, D., & Ovcharenko, I. (2016). Quantifying deleterious effects of regulatory variants. Nucleic Acids Research, 45(5), 2307-2317. PUBMEDDATASET 5
F1 contains 15741 pathogenic noncoding variants that were >10 bp from any splice sites (n =1,369). F2 contains 427 noncoding variants associated with Mendelian traits.
F3 contains 67144812 SNVs used for context dependent tolerance scores (CDTSs) computed with the subset of unrelated individuals (n = 7,794). F4 contains 34687974 SNVs used for CDTSs computed with the subset of unrelated individuals (n =1763) from ADMIX, admixed population group.
F5 contains 30634572 SNVs used for CDTSs computed with the subset of unrelated individuals (n =1087) from AFR, African population group. F6 contains 31893124 SNVs used for CDTSs computed with the subset of unrelated individuals (n =4436) from EUR, European population group. F7 contains 61372584 SNVs used for CDTSs computed with all individuals (n =11257) from all population groups.
Reference: Iulio, JD, Bartha, I, Wong, EHM, Yu, HC, Lavrenko, V, Yang, D, Jung, I, Hicks, MA, Shah, N, Kirkness, EF, Fabani, MM, Biggs, WH, Ren, B, Venter, JC & Telenti, A, The human noncoding genome defined by genetic diversity, Nature Genetics, 50, 333–337 (2018) . PUBMEDDATASET 6
4462 functional sequence variations in regulatory DNA regions in training dataset F1 and 1116 functional SNVs in test datasets F2.
Reference: Malkowska, M., Zubek, J., Plewczynski, D., & Wyrwicz, L. S. (2018). ShapeGTB: the role of local DNA shape in prioritization of functional variants in human promoters with machine learning. PeerJ, 6, e5742. doi:10.7717/peerj.5742. PUBMEDDATASET 7
This dataset contains pathogenic non-coding variants in Mendelian diseases. F1 contains 655 high-confidence pathogenic non-coding variants associated with monogenic Mendelian disease genes. F2 contains 6550 variants randomly sampled from the set of common human SNVs without clinical assertion associated with protein-coding genes in F1. F3 contains 770 variants as a validation set including 70 SNVs in ‘positive’ set in non-coding regions of protein-coding genes newly reported and 700 randomly sampled common human variants, matched per type of region to the ‘positive’ set.
Reference: Caron, B., Yufei L.Y., Rausel,A., NCBoost classifies pathogenic non-coding variants in Mendelian diseases through supervised learning on purifying selection signals in humans, Genome Biology, 2019, 20:32, pp. 1-22, https://doi.org/10.1186/s13059-019-1634-2. PUBMEDDATASET 8
Experimental validation results of randomly selected deltaSVM predictions from Tyr and Typr1 enhancers.
Reference:Lee,D., Gorkin, D.U., Baker,M., Strober, B.J., A.L.Asoni, A.S.McCallion, M.A.Beer, A method to predict the impact of regulatory variants from DNA sequence, Nat Genet;47(8):955-61. doi: 10.1038/ng.3331. PUBMEDDATASET 9
The database consists of 721 non-coding variants linked to the published literature describing the evidence of functional consequences. Also 7228 covariate-matched benign controls, that have a population frequency of over 5%, from the single nucleotide polymorphism database (dbSNP151) database.
Reference: Biggs, H., Parthasarathy,P., Gavryushkina, A., Gardner, P.P., ncVarDB: a manually curated database for pathogenic non-coding variants and benign controls,Database (Oxford);2020:baaa105. doi: 10.1093/database/baaa105. PUBMEDDATASET 10
Three training datasets including 1. regBase_REG and regBase_REG_Common dataset 2. regBase_PAT dataset 3. regBase_CAN dataset also eight independent testing datasets.
Reference: Zhang, Sh., He, Y., Liu, H., Zhai, H., Huang,H. et al., regBase: whole genome base-wise aggregation and functional prediction for human non-coding regulatory variantsa,Nucleic Acids Res. 2019;47(21):e134. doi: 10.1093/nar/gkz774. PUBMEDDATASET 11
The training set consists a total of 2873 SNVs with 345 as positive set and 2528 as negative set.
Reference: Wang, Y.,Jiang, Y., Yao, B. et al., WEVar: a novel statistical learning framework for predicting noncoding regulatory variants, Brief Bioinform 2021;22(6):bbab189. doi: 10.1093/bib/bbab189. PUBMEDLast updated: 2021-02-07 by Niloofar Shirvanizadeh.