VariSNP

VariSNP is a benchmark database suite comprising variation datasets that can be used for developing and testing the performance of variant effect prediction tools. VariSNP contains datasets selected from dbSNP which were filtered for disease-related variants found in ClinVar, Swiss-Prot and PhenCode, so all variations are considered neutral or non-pathogenic.

Here you find descriptions of the dataset columns, where columns 1-23 come from dbSNP; columns 24-29 have been generated with the Mutalyzer Name Checker tool (Mutalyzer) and columns 30-32 have been generated with the VariOtator batch tool (VariOtator):

dbSNP_id: dbSNP RefSNP cluster ID number (rs#)
heterozygosity: Estimated average heterozygosity from allele frequencies of this RefSNP. Values between 0 and 1. You can find a document describing the computation of average heterozygosity and standard error for dbSNP RefSNP clusters at NCBI
heterozygosity_standard_error: Standard error of heterozygosity estimate. See column 2
creation_date: Date when the RefSNP cluster was instantiated
creation_build: Build (NCBI release) number when the RefSNP cluster was created
update_date: Most recent date the RefSNP cluster was updated (member added or deleted)
update_build: Build number (NCBI release) when the RefSNP cluster was updated
observed_alleles: Observed variation alleles. All allele(s) observed at this position in the reference. Can be something like A/C or A/C/G/T or -/ACC
asn_from: Start position of snp on contig, counting from 0. This position is always from the beginning of the contig regardless of the snp orientation to contig and regardless of the contig orienation to chromosome
asn_to: End position of snp on contig
reference_allele: Reference allele(s), this can be a '-' in the case of an insertion
orientation: Orientation of RefSNP sequence to contig sequence. Values are 'forward' or 'reverse'
minor_allele_frequency: Global minor allele frequency. dbSNP is reporting the minor allele frequency for each rs included in a default global population. Since this is being provided to distinguish common polymorphism from rare variants, the MAF is actually the second most frequent allele value. In other words, if there are 3 alleles, with frequencies of 0.50, 0.49, and 0.01, the MAF will be reported as 0.49. The current default global population is 1000Genome phase 1 genotype data from 1094 worldwide individuals, released in the May 2011 dataset. Values from 0 to 0.50
minor_allele: Minor allele
sample_size: Sample size, which is the number of chromosomes in the sample population
validation: Validation method, type of evidence used to confirm the variation. Present values can be byHapMap; byOtherPop; byFrequency; by1000G; by2Hit2Allele; byCluster
hgvs_names: Description(s) of the variation according to HGVS recommendations
allele_origin: Genetic origin of the allele, e.g. germline, somatic, inherited, maternal
clinical_significance: Clinical significance. Assertions of clinical significance for alleles of human sequence variations are reported as provided by the submitter and not interpreted by NCBI. Submissions based on processing data from OMIM® were assigned the value of ‘probable-pathogenic’. If there is a published authoritative guideline about the pathogenicity of any allele, that is included in the report. The supported values are: unknown, untested, non-pathogenic, probable-non-pathogenic, probable-pathogenic, pathogenic, drug-response, histocompatibility, other
functional_class: Variation functional class. Variations are assigned functional classes, which report if a variation is located in a locus region, in a transcript, or in a coding region. This column contains one or more functional classes (fxnClass), values can be cds-indel, downstream-variant-500B, frameshift-variant, intron-variant, missense, nc-transcript-variant, reference, splice-acceptor-variant, splice-donor-variant, stop-gained, stop-lost, synonymous-codon, upstream-variant-2KB, utr-variant-3-prime. In this column you can also find values for a to the functional class corresponding Sequence Ontology term (soTerm), the mRNA accession (mrnaAcc) and version (mrnaVer), gene symbol (symbol) and the Entrez gene id (geneid)
ncbi_gi: NCBI gi number.
ncbi_accession: NCBI accession and version number of reference sequence, e.g. NG_01234.5
gene_symbol: Gene symbol (provided by HGNC).
refseq_start_description: Description relative to transcription start on reference sequence
coding_dna_description: Coding DNA variant description according to HGVS recommendations
protein_description: Protein variant description according to HGVS recommendations
coding_reference: NCBI RefSeq accession and version number (mRNA), e.g. NM_01234.5
protein_reference: NCBI RefSeq accession and version number (protein), e.g. NP_01234.5
predicted_RNA_variation: Predicted RNA variant description according to HGVS recommendations (without reference)
DNA_annotation: Variation Ontology VariO annotation on DNA level
RNA_annotation: Variation Ontology VariO annotation on RNA level
protein_annotation: Variation Ontology VariO annotation on protein level

A benchmark database for neutral variations from dbSNP