VariBench

1. Variation datasets affecting protein tolerance

DATASET 6

Dataset 6 is subset of the DATASET 3 extracted by clustering the protein sequences in dataset 3 based on their sequence similarity using CD-HIT suite (Huang et al., 2010) to remove close homologues which may cause problems with certain applications. The sequence identity cut-off was kept at 30% and the sequence with the highest number of variations was chosen to represent each cluster. This dataset has altogether 1592 variations on 272 proteins in human and non-human sequences.

Download: Clustered PMD dataset with representatives

Reference: Olatubosun A, Väliaho J, Härkönen J, Thusberg J, Vihinen M. PON-P: Integrated predictor for pathogenicity of missense variants. Hum Mutat. 2012 Apr 13. doi: 10.1002/humu.22102. PUBMED

A benchmark database for variations

1. Variation datasets affecting protein tolerance