BDQC: a general-purpose analytics validation tool for Big Data Discovery Science.

TitleBDQC: a general-purpose analytics validation tool for Big Data Discovery Science.
Publication TypeConference Paper
Year of Publication2015
AuthorsGlusman G, Kramer R, Deutsch EW, Foster I, Kesselman C, Madduri R, Chard K, Heavner BD, Dinov ID, Ames J, Van Horn J, Price ND, Hood LE, Toga AW
Conference NameAmerican Society for Human Genetics
Date Published10/2015
Type of WorkAbstract
AbstractBiomedical data acquisition is generating exponentially more data: thousands of whole-genome sequences (WGS) are now available; brain data is doubling every two years. Analyses of Big Data, genomic or otherwise, presents qualitatively new challenges as well as opportunities. Among the challenges is a proliferation in ways analyses can fail, due largely to the increasing length and complexity of processing pipelines. Anomalies in input data, runtime resource exhaustion or unavailability of nodes in a distributed computation can all cause pipeline hiccups that are not necessarily obvious in the output. Flaws that can taint results may persist undetected in complex pipelines, a danger amplified by the fact that research is often concurrent with the development of the software on which it depends. On the positive side, the huge sample sizes increase statistical power, which in turn can motivate entirely new analyses. We have developed a framework for Big Data Quality Control (BDQC): an extensible set of analyses, heuristic and statistical, that identify deviations in data without regard to its meaning (domain-blind analyses). BDQC takes advantage of large sample sizes to classify the samples, estimate distributions and identify outliers. Such outliers may be symptoms of technology failure (e.g., truncated output of one step of a pipeline for a single genome) or may reveal unsuspected “signal” in the data (e.g., evidence of aneuploidy in a genome).We have applied the framework to validate our WGS analysis pipelines. BDQC successfully identified data outliers representing genome analyses missing a whole chromosome or part thereof, hidden among thousands of intermediary output files. These failures could then be resolved by reanalyzing the affected samples. BDQC both identified hidden flaws (in some cases, in software deemed “too simple to fail”) as well as yielded unlooked-for insights into the data itself.BDQC is meant to complement quality software development practices. There are multiple benefits from application of BDQC at all pipeline stages. By controlling input, it can help avoid expensive computations on flawed data. Analysis of intermediary and final results facilitates recovery from aberrant termination of processes. All these computationally inexpensive verifications reduce cryptic analytical artifacts that could seriously preclude clinical-grade genome interpretation. This Big Data for Discovery Science work is supported by NIH 1U54EB020406.