CTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"> VariBench

A benchmark database for variations


Home | Instructions | Datasets | Citing | Disclaimer |


Protein stability

A. Single variants

These datasets are subsets of ProTherm.

Dataset 1

1784 variations from 80 proteins with experimentally determined ΔΔG values in ProTherm. 1154 positive cases of which 931 are destabilizing (ΔΔG ≤0.5 kcal/mol), 222 are stabilizing (ΔΔG ≥ -0.5 kcal/mol), and 631 neutral cases (0.5 kcal/mol > ΔΔG < -0.5 kcal/mol).

F
Reference:
Khan S, Vihinen M. Performance of protein stability predictors. Hum Mutat. 2010, 31(6):675-684.   PUBMED  

Dataset 2

2156 variations combined from a list of 964 single variations (Guerois et al. 2002) and from a set of 2972 single variations from the ProTherm after filtering for duplicate entries. NMR determined structures were excluded and only the average ΔΔG value was given when several ΔΔG values were present for a single variation.

F

Reference: Potapov V, Cohen M, Schreiber G. Assessing computational methods for predicting protein stability upon mutation: good on average but not in the details. Protein Eng Des Sel. 2009, 22(9):553-560.   PUBMED  

Dataset 3

Training dataset of 339 experimentally studied variants in nine proteins and 625 variants from ProTherm.

    Training dataset: 339 variants from 9 proteins.  Dataset 3(a)
  1. Blind test dataset: 625 variants from 28 proteins. Dataset 3(b)

Reference: Guerois R, Nielsen JE, Serrano L. Predicting changes in the stability of proteins and protein complexes: a study of more than 1000 mutations. J Mol Biol. 2002, 320(2):369-387.   PUBMED  

Dataset 4

S1615 was used for training/testing the neural network system. S388 was used as the test data and contains 388 variations collected only at physiological conditions. S388 is a subset of S1615. Only single variations with ΔΔG in ProTherm and structures deposited to PDB.

    Training dataset: S1615 - 1615 variants from 42 proteins. Dataset 4 (a) Test dataset - S388 (subset of the first) - 338 variants from 17 proteins.Dataset 4(b)

References: Capriotti E, Fariselli P, Casadio R. A neural-network-based method for predicting protein stability changes upon single point mutations. Bioinformatics. 2004, 20 Suppl 1:i63-68.   PUBMED  

Dataset 5

Dataset used for PON-Tstab

The correctness and quality of each variant was checked manually. The dataset contains 1564 variations from 99 proteins.

    PON-Tstab dataset

References: Yang, Y., Urolagin, S., Niroula, A., Ding, X., Shen, B., & Vihinen, M. (2018). PON-tstab: Protein Variant Stability Predictor. Importance of Training Data Quality. International journal of molecular sciences, 19(4), 1009. doi:10.3390/ijms19041009.   PUBMED  

Dataset 6

Datasets used for I-Mutant2.0.

  1. 2087 variants with sequence information I_Mutant2.0_S2087 dataset
  2. 1948 variants with 3D structures I_Mutant2.0_S2087 dataset

Reference: Capriotti, E.; Fariselli, P.; Casadio, R. I-Mutant2.0: predicting stability changes upon mutation from the protein sequence or structure. Nucleic Acids Res 2005, 33:W306-W310.   PUBMED  

Dataset 7

Datasets used by Saraboji and coworkers.

  1. 1791 variations with PDB structure. Thermal denaturation method Saraboji_S1791 dataset
  2. 1396 variants with thermal denaturation Saraboji_S1396 dataset
  3. 2204 variants with chemical denaturation Saraboji_S2204 dataset

Reference: Saraboji, K.; Gromiha, M. M.; Ponnuswamy, M. N. Average assignment method for predicting the stability of protein mutants. Biopolymers 2006, 82:80-92 doi: 10.1002/bip.20462.   PUBMED  

Dataset 8

Dataset used for iPTREE-STAB

    1859 single variants in 64 proteins iPTREE-STAB_S1859 dataset

Reference: Huang, L. T.; Gromiha, M. M.; Ho, S. Y. iPTREE-STAB: interpretable decision tree based method for predicting protein stability changes upon mutations. Bioinformatics 2007, 23:1292-1293.   PUBMED  

Dataset 9

Datasets used for SVM-WIN31 and SVM-3D12

  1. 1681 substitutions in 58 proteins SVM-WIN31_SVM-3D12_S1681 dataset
  2. 1634 variants in 55 proteins, PDB structures available SVM-WIN31_SVM-3D12_S1634 dataset
  3. 499 additional variants from a later version of ProThermSVM-WIN31_SVM-3D12_S499 dataset

Reference: Capriotti, E.; Fariselli, P.; Rossi, I.; Casadio, R. A three-state prediction of single point mutations on protein stability changes. BMC Bioinformatics 2008, 9 ddGSuppl 2: S6. doi: 10.1186/1471-2105-9-S2-S6   PUBMED  

Dataset 10

Dataset used for PoPMuSiC-2.0

    2648 subsitutitons in 131 proteinsPoPMuSiC-2.0_S2648 dataset

Reference: Dehouck, Y.; Grosfils, A.; Folch, B.; Gilis, D.; Bogaerts, P.; Rooman, M. Fast and accurate predictions of protein stability changes upon mutations using statistical potentials and neural networks: PoPMuSiC-2.0. Bioinformatics 2009, 25:2537-2543 doi: 10.1093/bioinformatics/btp445   PUBMED  

Dataset 11

Dataset used for sMMGB

    1109 variants SMMGB_1109 dataset

Reference: Zhang, Z.; Wang, L.; Gao, Y.; Zhang, J.; Zhenirovskyy, M.; Alexov, E. Predicting folding free energy changes upon single point mutations. Bioinformatics 2012, 28:664-671. doi: 10.1093/bioinformatics/bts005   PUBMED  

Dataset 12

Dataset used for M8 and M47

  1. 2760 variants in 75 proteins M47andM8_S2760 dataset
  2. 1810 variants in 71 proteins. Cases with ΔΔG between -0.5 and 0.5 kcal/mol excluded from S2760
    M47andM8_S1810 dataset

Reference: Yang, Y.; Chen, B.; Tan, G.; Vihinen, M.; Shen, B. Structure-based prediction of the effects of a missense variant on protein stability. Amino Acids 2013, 44:847-855 doi: 10.1007/s00726-012-1407-7   PUBMED  

Dataset 13

Dataset used for EASE-MM

  1. 238 variants, subselection of I-Mutant2.0 EASE-MM_S238 dataset
  2. 1676 variants EASE-MM_S1676 dataset
  3. 543 variants in 55 proteins. Subset PopMusici2.0 dataset of 2648 variants. <25% sequence identity to both S1676 and S236 EASE-MM_S543 dataset

Reference: Folkman, L.; Stantic, B.; Sattar, A. Feature-based multiple models improve classification of mutation-induced stability changes. BMC Genomics 2014, 15 Suppl 4:S6 doi: 10.1186/1471-2164-15-S4-S6   PUBMED  

Dataset 14

Dataset used for HoTMuSiC

    1626 variants in 90 proteins HotMuSiC_S1626 dataset

Reference: Pucci, F.; Bourgeas, R.; Rooman, M. Predicting protein thermal stability changes upon point mutations using statistical potentials: Introducing HoTMuSiC. Sci Rep 2016, 6:23257 doi: 10.1038/srep23257   PUBMED  

Dataset 15

Dataset used for SAAFEC

  1. 1262 variants in 49 proteins SAAFEC_S1262 dataset
  2. 983 variants in 42 proteins with 3D structures SAAFEC_S983 dataset

Reference: Getov, I.; Petukh, M.; Alexov, E. SAAFEC: Predicting the Effect of Single Point Mutations on Protein Folding Free Energy Using a Knowledge-Modified MM/PBSA Approach. Int J Mol Sci 2016, 17:512 doi: 10.3390/ijms1704051   PUBMED  

Dataset 16

Dataset used for STRUM

  1. 3421 variants, protein structures available STRUM_Q3421 dataset
  2. 306 variants in 32 proteins, sequence identity <60% to S2648 of PoPMuSiC STRUM_Q306 dataset

Reference: Quan, L.; Lv, Q.; Zhang, Y. STRUM: structure-based prediction of protein stability changes upon single-point mutation. Bioinformatics 2016, 32:2936-2946 doi: 10.1093/bioinformatics/btw361   PUBMED  

Dataset 17

Dataset used for a metapredictor

    605 variants in 60 proteins. Measurements at pH 5-9 and temperature 20-30℃Broom_S605 dataset

Reference: Broom, A.; Jacobi, Z.; Trainor, K.; Meiering, E. M. Computational tools help improve protein stability but with a solubility tradeoff. J Biol Chem 2017, 292:14349-14361 doi: 10.1074/jbc.M117.784165   PUBMED  

Dataset 18

Dataset used for Automute

  1. 1962 variants from S2204 of Saraboji et al. by removing cases which missed from PDB or had less than six nearest neighbours AUTOMUTE_S1962 dataset
  2. 1925 selection of S1948 (I-Mutant2.0) after filtering AUTOMUTE_S1925 dataset
  3. 1749 selection of S1791 of Saraboji et al. by removing cases which missed from PDB or had less than six nearest neighbours AUTOMUTE_S1749 dataset

Reference: Masso, M.; Vaisman, II. Accurate prediction of stability changes in protein mutants by combining machine learning with structure based computational mutagenesis. Bioinformatics 2008, 24:2002-2009 doi: 10.1093/bioinformatics/btn353   PUBMED  

Dataset 19

Dataset for TP53 variants

    42 variants in TP53 protein 42_variations_in_P53 dataset

Reference: Pires, DE.; Ascher, DB.; Blundell, TL. mCSM: predicting the effects of mutations in proteins using graph-based signatures. Bioinformatics 2014, 30:335-342 doi: 10.1093/bioinformatics/btt691   PUBMED  

Dataset 20

Dataset Ssym composed of 684 single-site variations inserted in 357 protein structures

    684 variants in 357 protein structures bty348_dataset

Reference: Pucci, F.; Bernaerts, KV.; Kwasigroch, JM.; Rooman, M. Quantification of biases in predictions of protein stability changes upon mutations. Bioinformatics 2018, bty348 doi: 10.1093/bioinformatics/bty348   PUBMED  

Dataset 21

F1 is for a alanine-scanning mutagenesis dataset including 768 “hot spots,” or amino acid side chains that are predicted to significantly destabilize the interface when altered to alanine. F2 is 2971 ProTherm single variations, F3 is 2154 variations from Potapov et al. [PMID:19561092], F4 is 1005 variations from Guerois et al. [PMID:12079393] and F5 is 380 variations from Kortemme and Baker dataset.

    F1     F2     F3     F4     F5

References: Kortemme, T.; Kim, D.E.; Baker, D. Computational Alanine Scanning of Protein-Protein Interfaces.SCIENCE'S STKE10, FEB 2004 : PL2.  PUBMED  

Tanja Kortemme, David Baker, A simple physical model for binding energy hot spots in protein–protein complexes, Proceedings of the National Academy of Sciences Oct 2002, 99 (22) 14116-14121; DOI: 10.1073/pnas.202485799.   PUBMED  

Dataset 22

The file is a set comprised of 1210 single mutations obtained from Protherm.

    F

Reference: Kellogg, E. H., Leaver-Fay, A., & Baker, D. (2010). Role of conformational sampling in computing mutation-induced changes in protein structure and stability. Proteins, 79(3), 830-8.  PUBMED  

Dataset 23

Dataset for PreTherMut

Both single and multiple variants. M-dataset 3366 variants, 836 stability increasing, 2530 stability decreasing variants.

    F

Reference: Tian, J., Wu, N., Chu, X., Fan, Y., Predicting changes in protein thermostability brought about by single- or multi-site mutations, BMC Bioinformatics;11:370. doi: 10.1186/1471-2105-11-370.  PUBMED  

Dataset 24

Dataset for iStable

F1 contains M3131 positive (increasing stability) and dataset F2 contains negative (decreasing stability) dataset. F3 is a training data set with 1311 data and F4 is a training data set with 1820 variants.

    F1    F2    F3    F3

Reference: Chen, C., Lin, J., Chu, Y., Stable: off-the-shelf predictor integration for predicting protein stability changes, BMC Bioinformatics;14 Suppl 2:S5. doi: 10.1186/1471-2105-14-S2-S5.  PUBMED  

Dataset 25

CAGI frataxin benchmark cases

F contains experimentally-determined ΔΔG values (in kcal / mol)

    F

Reference: Strokach, A., Corbi-Verge, C., Kim, P.M., Predicting changes in protein stability caused by mutation using sequence-and structure-based methods in a CAGI5 blind challenge, Hum Mutat;40(9):1414-1423. doi: 10.1002/humu.23852.  PUBMED  

Dataset 26

Dataset for iStable2

F1 is a training dataset (S3568), F2 is a test set (S630)

    F1     F2

Reference: Chen, C.W., Lin,M.H., Liao, C.C., Chang, H.P., Chu, Y.W., iStable 2.0: Predicting protein thermal stability changes by integrating various characteristic modules, Comput Struct Biotechnol J;18:622-630. doi: 10.1016/j.csbj.2020.02.021.  PUBMED  

Dataset 27

Dataset for benchmarking study

1024 variants, 585 destabilizing, 168 slightly destabilizing, 103 slightly stabilizing, 147 stabilizing, 21 no effect

    F

Reference: Marabotti, A., Prete, E.D., Scafuri, B., Facchiano, A., Performance of Web tools for predicting changes in protein stability caused by mutations, BMC Bioinformatics;22(Suppl 7):345. doi: 10.1186/s12859-021-04238-w.  PUBMED  

Dataset 28

Dataset for Thermonet

The data sets consisting of Q3214 and Q1744 variants and their associated experimental ΔΔGs.

F1 contains Q3214 data, F2 contains Q3214 reverse variants, F3 contains Q3214 direct variants, F4 contains Q1744 data, F5 contains Q1744 reverse variants, F6 contains Q1744 direct variants

    F1    F2    F3    F4    F5    F6

Reference: Li, B., Yang, Y.T., Capra, J.A., Gerstein, M.B., Predicting changes in protein thermodynamic stability upon point mutation with deep 3D convolutional neural networks, PLoS Comput Biol . 2020 Nov 30;16(11):e1008291. doi: 10.1371/journal.pcbi.1008291.  PUBMED  

Dataset 29

Dataset for ACDC-NN, free energy change prediction

S2648 contains 2,648 manually curated variants with experimentally measured ∆∆G values

Ssym provides variations on proteins whose wildtype and variant 3D structures are solved by X-ray crystallography. It contains 684 variations, and half of them are reverse variations

vb1423 variants

  1. s2648 (LR)       F0     F1     F2     F3     F4     F5     F6     F7     F8     F9
  2. s2648 (TS)       F0     F1     F2     F3     F4     F5     F6     F7     F8     F9
  3. s2648 (VL)       F0     F1     F2     F3     F4     F5     F6     F7     F8     F9
  4. ssym (TS-dir)   F0     F1     F2     F3     F4     F5     F6
  5. ssym (TS-inv)  F0     F1     F2     F3     F4     F5     F6     F7
  6. vb1423 (LR)    F0     F1     F2     F3     F4     F5     F6     F7     F8     F9
  7. vb1423 (TS)    F0     F1     F2     F3     F4     F5     F6     F7     F8     F9
  8. vb1423 (VL)    F0     F1     F2     F3     F4     F5     F6     F7     F8     F9

Reference: S Benevenuta, C Pancotti, P Fariselli, G Birolo and T Sanavia, An antisymmetric neural network to predict free energy changes in protein variants, S Benevenuta et al 2021 J. Phys. D: Appl. Phys. 54 245403.  PUBMED  

Dataset 30

Dataset for benchmark study. The file contains 19 experimental structures for the direct variants and 342 experimental structures for each of the reverse variants are known.

    F

Reference: C Pancotti, S Benevenuta, G Birolo, V Alberini, V Repetto, T Sanavia, E Capriotti, P Fariselli, Predicting protein stability changes upon single-point mutation: a thorough comparison of the available tools on a new dataset, Brief Bioinform. 2022 Mar 10;23(2):bbab555. doi: 10.1093/bib/bbab555.  PUBMED  

B. Double variants

These datasets contain cases with double variants

Dataset 1

Dataset used for WET-STAB

    D180 double variants in 27 proteins D180 dataset

Reference: Huang, LT.; Gromiha, MM. Reliable prediction of protein thermostability change upon double mutation from amino acid sequence. Bioinformatics 2009, 25:2181-2187 doi: 10.1093/bioinformatics/btp370   PUBMED  


Last updated: 2022-02-22 by Niloofar Shirvanizadeh.