|Abstract||Kaviar, first released in 2010, is now the largest publicly available catalog of whole genome human allele frequencies. Obtaining established allele frequencies of observed genetic variants is a key step in genome interpretation, and allele frequency accuracy depends on the size and diversity of the source data. Kaviar combines data from 37 sources including 5000+ whole genomes sequenced by the Inova Translational Medicine Institute (inova.org/itmi) and the Institute for Systems Biology (familygenomics.systemsbiology.net). It provides allele frequencies for 168 million SNVs, twice as many as the 1000 Genomes Project. In addition to the private whole genomes, Kaviar includes whole genome data from the 1000 Genomes Project, the Alzheimer’s Disease Neuroimaging Initiative (ADNI), the Simons Diversity Project, the Wellderly project, the UK10K, Genomes of the Netherlands, and the Personal Genomes Project, and also incorporates the 63,000 exomes from the Exome Aggregation Consortium (ExAC).
Kaviar uses a comprehensive pipeline to integrate sources in a manner that increases concordance, including normalization of all variants to enforce parsimony and left alignment. Based on 194 genomes sequenced on both Illumina and Complete Genomics platforms, we remap variants that are called frequently on one platform but never called on the other. Kaviar also reports population-specific frequencies computed by inferring the continent(s) of origin for each data source.
Kaviar now reports genotype frequency in addition to allele frequency, which can be useful in identifying genomic regions that require more inspection. Variants deviating from Hardy-Weinberg equilibrium may suggest errors in the genome reference, in sequencing, in analytics, or population structure. Such deviations may also point toward selection signatures. A variant never seen as homozygous may have a recessive effect. Differences in genotype distributions among different data sources may uncover differences in sequencing technology, differences in variant representation, or reference biases.
We propose Kaviar as the community’s standard resource for human allele and genotype frequencies. Kaviar is available via web interface, downloadable VCF files, and GA4GH beacon at db.systemsbiology.net/kaviar. |