Kelsey E Grinde, Brian L Browning, Alexander P Reiner, Timothy A Thornton, Sharon R Browning
{"title":"调整主成分会导致全基因组关联研究中的对撞机偏差。","authors":"Kelsey E Grinde, Brian L Browning, Alexander P Reiner, Timothy A Thornton, Sharon R Browning","doi":"10.1371/journal.pgen.1011242","DOIUrl":null,"url":null,"abstract":"<p><p>Principal component analysis (PCA) is widely used to control for population structure in genome-wide association studies (GWAS). Top principal components (PCs) typically reflect population structure, but challenges arise in deciding how many PCs are needed and ensuring that PCs do not capture other artifacts such as regions with atypical linkage disequilibrium (LD). In response to the latter, many groups suggest performing LD pruning or excluding known high LD regions prior to PCA. However, these suggestions are not universally implemented and the implications for GWAS are not fully understood, especially in the context of admixed populations. In this paper, we investigate the impact of pre-processing and the number of PCs included in GWAS models in African American samples from the Women's Health Initiative SNP Health Association Resource and two Trans-Omics for Precision Medicine Whole Genome Sequencing Project contributing studies (Jackson Heart Study and Genetic Epidemiology of Chronic Obstructive Pulmonary Disease Study). In all three samples, we find the first PC is highly correlated with genome-wide ancestry whereas later PCs often capture local genomic features. The pattern of which, and how many, genetic variants are highly correlated with individual PCs differs from what has been observed in prior studies focused on European populations and leads to distinct downstream consequences: adjusting for such PCs yields biased effect size estimates and elevated rates of spurious associations due to the phenomenon of collider bias. Excluding high LD regions identified in previous studies does not resolve these issues. LD pruning proves more effective, but the optimal choice of thresholds varies across datasets. Altogether, our work highlights unique issues that arise when using PCA to control for ancestral heterogeneity in admixed populations and demonstrates the importance of careful pre-processing and diagnostics to ensure that PCs capturing multiple local genomic features are not included in GWAS models.</p>","PeriodicalId":49007,"journal":{"name":"PLoS Genetics","volume":"20 12","pages":"e1011242"},"PeriodicalIF":4.0000,"publicationDate":"2024-12-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11684764/pdf/","citationCount":"0","resultStr":"{\"title\":\"Adjusting for principal components can induce collider bias in genome-wide association studies.\",\"authors\":\"Kelsey E Grinde, Brian L Browning, Alexander P Reiner, Timothy A Thornton, Sharon R Browning\",\"doi\":\"10.1371/journal.pgen.1011242\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p><p>Principal component analysis (PCA) is widely used to control for population structure in genome-wide association studies (GWAS). Top principal components (PCs) typically reflect population structure, but challenges arise in deciding how many PCs are needed and ensuring that PCs do not capture other artifacts such as regions with atypical linkage disequilibrium (LD). In response to the latter, many groups suggest performing LD pruning or excluding known high LD regions prior to PCA. However, these suggestions are not universally implemented and the implications for GWAS are not fully understood, especially in the context of admixed populations. In this paper, we investigate the impact of pre-processing and the number of PCs included in GWAS models in African American samples from the Women's Health Initiative SNP Health Association Resource and two Trans-Omics for Precision Medicine Whole Genome Sequencing Project contributing studies (Jackson Heart Study and Genetic Epidemiology of Chronic Obstructive Pulmonary Disease Study). In all three samples, we find the first PC is highly correlated with genome-wide ancestry whereas later PCs often capture local genomic features. The pattern of which, and how many, genetic variants are highly correlated with individual PCs differs from what has been observed in prior studies focused on European populations and leads to distinct downstream consequences: adjusting for such PCs yields biased effect size estimates and elevated rates of spurious associations due to the phenomenon of collider bias. Excluding high LD regions identified in previous studies does not resolve these issues. LD pruning proves more effective, but the optimal choice of thresholds varies across datasets. Altogether, our work highlights unique issues that arise when using PCA to control for ancestral heterogeneity in admixed populations and demonstrates the importance of careful pre-processing and diagnostics to ensure that PCs capturing multiple local genomic features are not included in GWAS models.</p>\",\"PeriodicalId\":49007,\"journal\":{\"name\":\"PLoS Genetics\",\"volume\":\"20 12\",\"pages\":\"e1011242\"},\"PeriodicalIF\":4.0000,\"publicationDate\":\"2024-12-16\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11684764/pdf/\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"PLoS Genetics\",\"FirstCategoryId\":\"99\",\"ListUrlMain\":\"https://doi.org/10.1371/journal.pgen.1011242\",\"RegionNum\":2,\"RegionCategory\":\"生物学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"2024/12/1 0:00:00\",\"PubModel\":\"eCollection\",\"JCR\":\"Q1\",\"JCRName\":\"GENETICS & HEREDITY\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"PLoS Genetics","FirstCategoryId":"99","ListUrlMain":"https://doi.org/10.1371/journal.pgen.1011242","RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2024/12/1 0:00:00","PubModel":"eCollection","JCR":"Q1","JCRName":"GENETICS & HEREDITY","Score":null,"Total":0}
引用次数: 0
摘要
主成分分析(PCA)被广泛用于控制全基因组关联研究(GWAS)中的群体结构。顶部主成分(PCs)通常反映了群体结构,但在决定需要多少个 PCs 以及确保 PCs 不会捕捉到其他假象(如具有非典型连锁不平衡(LD)的区域)方面存在挑战。针对后者,许多研究小组建议在 PCA 之前进行 LD 修剪或排除已知的高 LD 区域。然而,这些建议并没有得到普遍实施,对 GWAS 的影响也没有得到充分理解,尤其是在混杂人群中。在本文中,我们从妇女健康倡议 SNP 健康关联资源和两项跨奥美精准医学全基因组测序项目贡献研究(杰克逊心脏研究和慢性阻塞性肺病遗传流行病学研究)的非裔美国人样本中,研究了预处理和 GWAS 模型中 PCs 数量的影响。在所有三个样本中,我们发现第一个 PC 与全基因组祖先高度相关,而后面的 PC 通常捕捉局部基因组特征。哪些基因变异以及有多少基因变异与单个 PC 高度相关,这种模式与之前针对欧洲人群的研究中观察到的模式不同,并导致了不同的下游后果:由于对撞机偏差现象,调整这些 PC 会产生有偏差的效应大小估计值和较高的虚假关联率。排除以往研究中发现的高 LD 区域并不能解决这些问题。低密度修剪证明更为有效,但不同数据集的最佳阈值选择各不相同。总之,我们的研究突出了使用 PCA 控制混血人群祖先异质性时出现的独特问题,并证明了仔细预处理和诊断的重要性,以确保捕获多个局部基因组特征的 PC 不被纳入 GWAS 模型。
Adjusting for principal components can induce collider bias in genome-wide association studies.
Principal component analysis (PCA) is widely used to control for population structure in genome-wide association studies (GWAS). Top principal components (PCs) typically reflect population structure, but challenges arise in deciding how many PCs are needed and ensuring that PCs do not capture other artifacts such as regions with atypical linkage disequilibrium (LD). In response to the latter, many groups suggest performing LD pruning or excluding known high LD regions prior to PCA. However, these suggestions are not universally implemented and the implications for GWAS are not fully understood, especially in the context of admixed populations. In this paper, we investigate the impact of pre-processing and the number of PCs included in GWAS models in African American samples from the Women's Health Initiative SNP Health Association Resource and two Trans-Omics for Precision Medicine Whole Genome Sequencing Project contributing studies (Jackson Heart Study and Genetic Epidemiology of Chronic Obstructive Pulmonary Disease Study). In all three samples, we find the first PC is highly correlated with genome-wide ancestry whereas later PCs often capture local genomic features. The pattern of which, and how many, genetic variants are highly correlated with individual PCs differs from what has been observed in prior studies focused on European populations and leads to distinct downstream consequences: adjusting for such PCs yields biased effect size estimates and elevated rates of spurious associations due to the phenomenon of collider bias. Excluding high LD regions identified in previous studies does not resolve these issues. LD pruning proves more effective, but the optimal choice of thresholds varies across datasets. Altogether, our work highlights unique issues that arise when using PCA to control for ancestral heterogeneity in admixed populations and demonstrates the importance of careful pre-processing and diagnostics to ensure that PCs capturing multiple local genomic features are not included in GWAS models.
期刊介绍:
PLOS Genetics is run by an international Editorial Board, headed by the Editors-in-Chief, Greg Barsh (HudsonAlpha Institute of Biotechnology, and Stanford University School of Medicine) and Greg Copenhaver (The University of North Carolina at Chapel Hill).
Articles published in PLOS Genetics are archived in PubMed Central and cited in PubMed.