VariBench

1. Variation datasets affecting protein tolerance

DATASET 5

This DATASET 5 is a subset of the DATASET2 obtained by clustering the protein sequences based on their sequence similarity to remove close homologues which may cause problems with certain applications. Clustering was performed using CD-HIT suite (Huang et al., 2010) with the sequence identity cut-off 30%. Sequence with the highest number of variations was chosen to represent each cluster. This dataset is also composed of neutral and pathogenic datasets.

Clustered pathogenic single nucleotide substitutions from Dataset 2

This dataset contains amino acid substitutions in 884 representative sequences (clusters)

Download: Clustered pathogenic dataset (Olatubosun et al., 2012)

Download: Clustered pathogenic dataset annotated with VariO**

Clustered neutral single nucleotide substitutions from Dataset 1

This is the negative (neutral) dataset or non synonymous coding SNV dataset comprising human non synonymous coding SNVs on 5469 representative sequences (clusters) from the dbSNP database build 131.

Download: Clustered neutral dataset (Olatubosun et al., 2012)*

Download: Clustered neutral dataset annotated with VariO**

References:
Olatubosun A, Väliaho J, Härkönen J, Thusberg J, Vihinen M. PON-P: Integrated predictor for pathogenicity of missense variants. Hum Mutat. 2012 Apr 13. doi: 10.1002/humu.22102. PUBMED
Huang Y, Niu B, Gao Y, Fu L, Li W. CD-HIT Suite: a web server for clustering and comparing biological sequences. Bioinformatics. 2010. 26:680-682. PUBMED

* Last updated: 2013-01-02.

** Tab-delimited file, updated: 2013-11-12.

A benchmark database for variations

1. Variation datasets affecting protein tolerance