|Abstract||Large-scale clinical genomics studies consisting of thousands of whole genome sequences require a system with the ability to analyze billions of unique variants over terabytes of data, generated with different technologies by disparate laboratories. Modern big data/cloud technologies, rather than traditional tab-delimited file-based approaches, are advantageous and increasingly necessary for scalable, high access genome analysis workflows and tools. Additionally, large collections of standardized annotations enable the evaluation and interpretation of genomic variation and improve the quality of analysis results. To this end we have developed a cross-platform, Hadoop-based system for distributed storage and query of annotations and large collections of genome sequences.
The system has been designed to integrate whole genome sequences from Illumina and Complete Genomics, with an extensible model for genomic data integration derived from multiple sequencing platforms. The genomic data model normalizes technical details such as variant call quality and no-calls; supports versioning and provenance of evolving annotation sources, genome builds, and sequences originating from different studies; and scales seamlessly to handle increases in content with minimal impact on performance. Over twenty annotation sources have been standardized, normalized, and integrated for deployment in several cloud-based or in-house platforms. These annotation sources are joined with sequencing data to support a variety of in-database analysis workflows and ad hoc queries, rather than static tab-delimited vcf files. Incremental and modular updates to the system are possible as new genome sequences are added to continuously growing collections and annotation sources are modified. Data replication of sequences and annotations, as well as indexing/partitioning schemes, are employed to enhance performance for scenarios ranging from deep analysis of family pedigree to association studies of large collections of genomes. Distributed queries can be coupled with parallel computation, for example with Python or R through industry standard APIs (e.g., ODBC or GA4GH). As the use of genomics in the clinic grows, large scale, distributed, normalized data systems such as the one outlined here, will be imperative to continuously improve the accuracy and quality of personalized medicine. |