VariBench

DNA regulatory elements

DATASET 1

Mendelian regulatory variations including 42 enhancer, 142 promoter, 153 5' UTR, 43 3' UTR, 65 RNA gene, 3 imprinting control region, and 5 microRNA gene variations.

Reference: Smedley, D., Schubach, M., Jacobsen, J., Köhler, S., Zemojtel, T., Spielmann, M., Jäger, M., Hochheiser, H., Washington, N. L., McMurry, J. A., Haendel, M. A., Mungall, C. J., Lewis, S. E., Groza, T., Valentini, G., … Robinson, P. N. (2016). A Whole-Genome Analysis Framework for Effective Identification of Pathogenic Regulatory Variants in Mendelian Disease. American Journal of Human Genetics, 99(3), 595-606. PUBMED

DATASET 2

27558 Mendelian disease regulatory variants from OMIM and ClinVar, 20963 complex disease regulatory variants from VarDi and NHGRI GWAS Catalog and 43364 recurrent cancer somatic variations.

Reference: Ma, M., Ru, Y., Chuang, L. S., Hsu, N. Y., Shi, L. S., Hakenberg, J., Cheng, W. Y., Uzilov, A., Ding, W., Glicksberg, B. S., Chen, R. (2015). Disease-associated variants in different categories of disease located in distinct regulatory elements. BMC Genomics, 16 Suppl 8(Suppl 8), S3. PUBMED

DATASET 3

225 Functional regulatory SNVs in monogenic and complex diseases and 241910 SNVs from dbSNP as a negative control dataset.

Reference: Zhao, Y., Clark, W. T., Mort, M., Cooper, D. N., Radivojac, P., & Mooney, S. D. (2011). Prediction of functional regulatory SNPs in monogenic and complex disease. Human Mutation, 32(10), 1183-90. PUBMED

DATASET 4

Dataset used for CAPE

7948 control SNVs used for training and testing, 4044 control SNVs from HepG2, 2693 SNVs of dsQTL, 51 deSNVs and 156 enhancer SNVs in B-cells from NHGRI GWAS Catalog, 56497 GM12878 enhancer SNVs and 2029 variants in training set of dsQTL model within hotspot DHS region of non-blood cell.

Reference: Li, S., Alvarez, R. V., Sharan, R., Landsman, D., & Ovcharenko, I. (2016). Quantifying deleterious effects of regulatory variants. Nucleic Acids Research, 45(5), 2307-2317. PUBMED

DATASET 5

Dataset used for CDTS

F1 contains 15741 pathogenic noncoding variants that were >10 bp from any splice sites (n =1,369). F2 contains 427 noncoding variants associated with Mendelian traits.

F3 contains 67144812 SNVs used for context dependent tolerance scores (CDTSs) computed with the subset of unrelated individuals (n = 7,794). F4 contains 34687974 SNVs used for CDTSs computed with the subset of unrelated individuals (n =1763) from ADMIX, admixed population group.

F5 contains 30634572 SNVs used for CDTSs computed with the subset of unrelated individuals (n =1087) from AFR, African population group. F6 contains 31893124 SNVs used for CDTSs computed with the subset of unrelated individuals (n =4436) from EUR, European population group. F7 contains 61372584 SNVs used for CDTSs computed with all individuals (n =11257) from all population groups.

Reference: Iulio, JD, Bartha, I, Wong, EHM, Yu, HC, Lavrenko, V, Yang, D, Jung, I, Hicks, MA, Shah, N, Kirkness, EF, Fabani, MM, Biggs, WH, Ren, B, Venter, JC & Telenti, A, The human noncoding genome defined by genetic diversity, Nature Genetics, 50, 333–337 (2018) . PUBMED

DATASET 6

Dataset used for ShapeGTP

4462 functional sequence variations in regulatory DNA regions in training dataset F1 and 1116 functional SNVs in test datasets F2.

Reference: Malkowska, M., Zubek, J., Plewczynski, D., & Wyrwicz, L. S. (2018). ShapeGTB: the role of local DNA shape in prioritization of functional variants in human promoters with machine learning. PeerJ, 6, e5742. doi:10.7717/peerj.5742. PUBMED

DATASET 7

Dataset used for NCBoost

This dataset contains pathogenic non-coding variants in Mendelian diseases. F1 contains 655 high-confidence pathogenic non-coding variants associated with monogenic Mendelian disease genes. F2 contains 6550 variants randomly sampled from the set of common human SNVs without clinical assertion associated with protein-coding genes in F1. F3 contains 770 variants as a validation set including 70 SNVs in ‘positive’ set in non-coding regions of protein-coding genes newly reported and 700 randomly sampled common human variants, matched per type of region to the ‘positive’ set.

Reference: Caron, B., Yufei L.Y., Rausel,A., NCBoost classifies pathogenic non-coding variants in Mendelian diseases through supervised learning on purifying selection signals in humans, Genome Biology, 2019, 20:32, pp. 1-22, https://doi.org/10.1186/s13059-019-1634-2. PUBMED

DATASET 8

Dataset for deltaSVM

Experimental validation results of randomly selected deltaSVM predictions from Tyr and Typr1 enhancers.

Reference:Lee,D., Gorkin, D.U., Baker,M., Strober, B.J., A.L.Asoni, A.S.McCallion, M.A.Beer, A method to predict the impact of regulatory variants from DNA sequence, Nat Genet;47(8):955-61. doi: 10.1038/ng.3331. PUBMED

DATASET 9

Dataset for ncVarDB

The database consists of 721 non-coding variants linked to the published literature describing the evidence of functional consequences. Also 7228 covariate-matched benign controls, that have a population frequency of over 5%, from the single nucleotide polymorphism database (dbSNP151) database.

Reference: Biggs, H., Parthasarathy,P., Gavryushkina, A., Gardner, P.P., ncVarDB: a manually curated database for pathogenic non-coding variants and benign controls,Database (Oxford);2020:baaa105. doi: 10.1093/database/baaa105. PUBMED

DATASET 10

Dataset for regBase

Three training datasets including 1. regBase_REG and regBase_REG_Common dataset 2. regBase_PAT dataset 3. regBase_CAN dataset also eight independent testing datasets.

Reference: Zhang, Sh., He, Y., Liu, H., Zhai, H., Huang,H. et al., regBase: whole genome base-wise aggregation and functional prediction for human non-coding regulatory variantsa,Nucleic Acids Res. 2019;47(21):e134. doi: 10.1093/nar/gkz774. PUBMED

DATASET 11

Dataset for WEVar

The training set consists a total of 2873 SNVs with 345 as positive set and 2528 as negative set.

Reference: Wang, Y.,Jiang, Y., Yao, B. et al., WEVar: a novel statistical learning framework for predicting noncoding regulatory variants, Brief Bioinform 2021;22(6):bbab189. doi: 10.1093/bib/bbab189. PUBMED

A benchmark database for variations