VariBench

Test datasets

These datasets have been used for testing method performances.

DATASET 1

Grimm circularity dataset

Filtered versions of five publicly available benchmark datasets for pathogenicity prediction. The sets were filtered/selected from HumVar, ExoVar, PredictSNP, VariBench and SwissVar.

DATASET 1

Reference: Grimm D, Azencott C, Aicheler F, Gieraths U, MacArthur D, Samocha K, Cooper D, Stenson P, Daly M, Smoller J, Duncan L, Borgwardt K, 2015. Evaluation of tools used to predict the impact of missense variants is hindered by two types of circularity. Hum Mutat. doi:10.1002/humu.22768. PUBMED

DATASET 2

Dataset used for ACMG/AMP clinical variant interpretation

14,819 benign or pathogenic variants from the ClinVar database. F1 contains 14,819 ClinVar one star variants (7346 benign and 7473 pathogenic variants). F2 contains 1442 TSG variants. F3 contains 4667 variants in genes, with both dominant and recessive modes of inheritance. F4 and F5 contain 6931 variants from ClinVar Sept 2016 and 5379 variants from ClinVar March 2017 of benign and pathogenic types with one star or above. F6 contains 12,496 (6275 benign and 6221 pathogenic) ClinVar variants (with one or more review status in ClinVar). F7 contains 14,819 benign or pathogenic variants from ClinVar Sept-December 2016. F8 contains 4192 variants from ClinVar (one star or above status) with each gene having the same number of benign and pathogenic variants in balanced dataset. F9 contains 16064 variants in predictSNPdsel benchmark dataset. F10 contains 10308 variants in highly unbalanced VariBenchselected dataset. F11 contains 7766 variants (4473 benign and 3293 pathogenic) in Exclude LP and LB set, which were asserted “pathogenic” and “benign” in the ClinVar September 2016 release.

Reference: Ghosh, R., Oak, N., & Plon, S. E. (2017). Evaluation of in silico algorithms for use with ACMG/AMP clinical variant interpretation guidelines. Genome biology, 18(1), 225. doi:10.1186/s13059-017-1353-5. PUBMED

DATASET 3

Dataset used for performance evaluation of pathogenicity computation methods

11995 amino acid substitutions from the ClinVar related to genetic diseases, somatic variants from the IARC TP53 and ICGC databases related to human cancers and experimentally evaluated PPARG variants.

Reference: Li J, Zhao T, Zhang Y, Zhang K, Shi L, Chen Y, Wang X, Sun Z; Performance evaluation of pathogenicity-computation methods for missense variants, Nucleic Acids Research, Volume 46, Issue 15, 6 September 2018, Pages 7793–7804, https://doi.org/10.1093/nar/gky678. PUBMED

DATASET 4

Dataset used for PRDIS

28474 pathogenic and 336730 neutral variants in 228 proteins.

F1,

Reference: de la Campa, E. Á., Padilla, N., & de la Cruz, X. (2017). Development of pathogenicity predictors specific for variants that do not comply with clinical guidelines for the use of computational evidence. BMC genomics, 18(Suppl 5), 569. doi:10.1186/s12864-017-3914-0. PUBMED

DATASET 5

Dataset used for the in silico assessment of pathogenicity for compensated variants

1964 compensated pathogenic deviations (CPDs) in 684 protein-coding genes.

Reference: Azevedo, L., Mort, M., Costa, A. C., Silva, R. M., Quelhas, D., Amorim, A., & Cooper, D. N. (2016). Improving the in silico assessment of pathogenicity for compensated variants. European journal of human genetics : EJHG, 25(1), 2-7. PUBMED

DATASET 6

VariBench representativeness datasets

VariBench tolerance datasets with mappings to CATH, Pfam, EC and GO used in a study of dataset representativeness.

VariBench Representativeness Datasets

Reference: Schaafsma G, Vihinen M, Representativeness of variation benchmark datasets, BMC Bioinformatics, 2018, Nov, 29, vol. 19, no. 1, pp. 461,doi="10.1186/s12859-018-2478-6. PUBMED

DATASET 7

Assessment with clinical dataset

The file contains 1757 variants.

Reference:Gunning, A.C., Fryer, V., Fasham, J., Crosby, A.H., Ellard, S., Baple, E.L., and Wright, C.F. (2021). Assessing performance of pathogenicity predictors using clinically relevant variant datasets. J Med Genet 58, 547-555. PUBMED

DATASET 8

Dataset for benchmarking 24 tools

F1 contains 35,167 pathogenic variants. Of these, 16,411 were nonsense variants and 18,756 were amino acid substitutions. F2 contains 29,173 benign variants.

Reference:Anderson D, Lassmann T. An expanded phenotype centric benchmark of variant prioritisation tools.Hum Mutat. 2022 May;43(5):539-546. doi: 10.1002/humu.24362. PUBMED

DATASET 9

Rett syndrome variants

The golden data set contains 2123 pathogenic, 2231 benign variants.

Reference: Ganakammal, S.R., and Alexov, E. (2019). Evaluation of performance of leading algorithms for variant pathogenicity predictions and designing a combinatory predictor method: application to Rett syndrome variants. PeerJ 7, e8106. PUBMED

Last updated: 2022-06-28 by Niloofar Shirvanizadeh.

A benchmark database for variations