A new population structure analysis approach specifically designed for whole genome sequence data

Title A new population structure analysis approach specifically designed for whole genome sequence data
Publication TypeConference Paper
Year of Publication2015
AuthorsRobinson M, Wong WSW, Solomon BD, Vockley JG, Schmulevich I, Glusman G, Niederhuber JE
Conference NameAmerican Society for Human Genetics
Date Published10/2015
Type of WorkAbstract
AbstractWe developed a new approach to population structure analysis that recognizes differences between closely related populations better than standard methods involving principal component analysis (PCA) followed by k-means clustering. We find that PCA’s normalization overweights common variants. We developed a novel method, Scaled Singular Value Decomposition (SVD), that weights variants equally and takes full advantage of the rarer variants now observable by whole-genome sequencing (WGS). As a result, population structure within continents is clearly resolved. Scaled SVD also facilitates analysis of admixture by locating variants as well as samples on the same principal components. Furthermore, k-means clustering ignores the ordered nature of principal components and the hierarchical structure of populations. We thus developed abcTree, a top-down hierarchical clustering algorithm specifically intended for population structure identification. abcTree employs adaptive Bonferroni correction to resist over-interpreting population structure.We evaluated our new methods (separately and in combination) on two large WGS cohorts: the 2504 samples (26 ascertained populations) of the 1000 Genomes Project, and an ethnically diverse cohort of 3483 founders from family trios in Northern Virginia, acquired by the Inova Translational Medicine Institute (ITMI, www.inova.org/itmi). The ITMI cohort is annotated with a rich set of multi-omic and phenotypic data, and metadata including self-reported country of birth. We analyzed multiple sets of variants, including common variants sampled from a genotyping array and variants of all frequencies randomly sampled from WGS data. In all combinations, Scaled SVD and abcTree separated samples better than PCA and k-means clustering. For the 1000 Genomes cohort, our method better recapitulated the annotated populations. For the ITMI cohort, the clusters corresponded better with self-reported country of birth.Precise identification of a patient’s population of origin enables improved care via better diagnosis of inherited diseases and assessment of disease risks. It also increases power in genome-wide association studies (GWAS). We anticipate that these improvements in population structure analysis will be increasingly valuable as whole genome sequencing enters clinical practice.
Full Text