VariBench

1. Variation datasets affecting protein tolerance

DATASET 2

This DATASET 2 is a subset of Dataset 1 from which cancer cases were removed. it is also composed of neutral and pathogenic datasets

Dataset of neutral single nucleotide substitutions

This is the negative (neutral) dataset or non synonymous coding SNV dataset comprising 17,393 human non synonymous coding SNVs extracted from the dbSNP database build 131 by filtering variations with population frequency >0.1 and with chromosome count >=50. The variant position mapping for this dataset was extracted from dbSNP database.

Download: Neutral Dataset*

Download: Neutral Dataset annotated with VariO**

Dataset of pathogenic single nucleotide substitutions

This is the pathogenic (positive) dataset of 14,610 amino acid substitutions obtained by manual curation from the PhenCode database (downloaded in June 2009), IDbases and from 16 individual LSDBs. For this dataset, the variations along with the variant position mappings to RefSeq protein (>=99% match), RefSeq mRNA and RefSeq genomic sequences are available.

Download: Pathogenic_Dataset

Download: Pathogenic_Dataset annotated with VariO**

Reference: Thusberg J, Olatubosun A, Vihinen M. Performance of mutation pathogenicity prediction methods on missense variants. Hum Mutat. 2011, 32(4):358-68. PUBMED

* Last updated: 2013-01-02.

** Tab-delimited file, updated: 2013-11-12.

A benchmark database for variations

1. Variation datasets affecting protein tolerance