A benchmark database for variations


Home | Instructions | Datasets | Citing | Disclaimer |


(a) Insertions and Deletions

DATASET 1

Dataset for DDIG-in

The positive (disease-causing) insertion and deletion dataset consists of 9007 variants, 2667 frameshift (FS) insertions and deletions and 6340 non frame-shift (NFS) insertions and deletions. >35% sequence identity threshold was used to avoid sequence redundancy.

    F1,      F2,     F3,     F4

The neutral dataset contains 8861 micro-insertions and deletions from the 1,000 Genomes Project (20101123 release). There is no overlap between the positions of the microdeletions in these datasets. There are 2587 FS insertions and deletions and 6274 NFS-insertions and deletions.

    F5,      F6,     F7,     F8

Reference: Folkman L, Yang Y, Li Z, Stantic B, Sattar A, Mort M, Cooper DN, Liu Y, Zhou Y. DDIG-in: detecting disease-causing genetic variations due to frameshifting indels and nonsense mutations employing sequence and structural properties at nucleotide and protein levels. Bioinformatics. 2015 May 15;31(10):1599-606. doi: 10.1093/bioinformatics/btu862. Epub 2015 Jan 7. PubMed PMID: 25573915.  PUBMED  

DATASET 2

Dataset for ENTPRISE-X

The pathogenic dataset contains 6513 frameshift (FS) and 5023 non frameshift (NFS) insertions and deletions from ClinVar database. 82 VEST-indel variants were included in the test dataset.

    F1,      F2,     F3

The training datasets from the 1000 Genomes Project phase 3 contains 366 FS and 3171 NFS insertions and deletions.

    F4,      F5

The neutral datasets from ESP6500 consists of 1604 FS and 181 NFS insertions and deletions.

    F6,      F7

The VEST-indel dataset consist of 1025 neutral variants.

    F8

Reference: Zhou H, Gao M, Skolnick J (2018) ENTPRISE-X: Predicting disease-associated frameshift and nonsense mutations. PLoS ONE 13(5): e0196849. https://doi.org/10.1371/journal.pone.0196849  PUBMED  

DATASET 3

Dataset for KD4i

The NFS-indel dataset contains 2734 variants from UniProtKB/Swiss-Prot database mapped to 1535 distinct proteins.

    F1

Reference: Bermejo-Das-Neves C, Nguyen HN, Poch O, Thompson JD. A comprehensive study of small non-frameshift insertions/deletions in proteins and prediction of their phenotypic effects by a machine learning method (KD4i). BMC Bioinformatics. 2014 Apr 17;15:111. doi: 10.1186/1471-2105-15-111. PubMed PMID: 24742296; PubMed Central PMCID: PMC4021375.  PUBMED  

DATASET 4

Dataset for SIFT-Indel

Insertions and deletions that have a length divisible by 3 cause amino acid insertions/deletions or block substitutions are called 3n changes. This dataset contains 9710 3n neutral insertions and deletions identified from UCSC mammalian alignments, chosen so that there is one random insertion or deletion per gene. Subset of 474 3n neutral insertions and deletions was used for method training and cross-validation.

    F1,      F2

Reference: Hu J, Ng PC.(2013) SIFT Indel: Predictions for the Functional Effects of Amino Acid Insertions/Deletions in Proteins. PLOS ONE 8(10): e77940. https://doi.org/10.1371/journal.pone.0077940  PUBMED  

DATASET 5

Dataset for MutPredIndel

F1 contains neutral training data, F2 contains pathogenic training data, F3 contains ASD variants.

    F1      F2      F3

Reference: Pagel K, Antaki D, Lian A, Mort M, Cooper D, Sebat J, Iakoucheva L, Mooney S, Radivojac P, Pathogenicity and functional impact of non-frameshifting insertion/deletion variation in the human genome. PLoS Comput Biol. 2019. doi: 10.1371/journal.pcbi.1007112.  PUBMED  

DATASET 6

Dataset for MutPredLof

Dataset for training predictor for frameshifting insertions and deletions.

    F1      F2

Reference: Pejaver, V., Urresti, J., Lugo-Martinez, J., Pagel, K.A., Lin, G.N., Nam, H.J., Mort, M., Cooper, D.N., Sebat, J., Iakoucheva, L.M., et al. (2020). Inferring the molecular and phenotypic impact of amino acid variants with MutPred2. Nat Commun 11, 5918.  PUBMED  


Last updated: 2021-09-03 by Niloofar Shirvanizadeh.