SPAmix: a scalable, accurate, and universal analysis framework for large-scale genetic association studies in admixed populations

IF 10.1 1区生物学 Q1 BIOTECHNOLOGY & APPLIED MICROBIOLOGY

Genome Biology Pub Date : 2025-10-16 DOI:10.1186/s13059-025-03827-9

Yuzhuo Ma, He Xu, Ying Li, Hyesung Kim, Lin-lin Xu, Lin Miao, Peng Xu, Fengbiao Mao, Xu-jie Zhou, Wei Zhou, Seunggeun Lee, Ji-Feng Zhang, Peipei Zhang, Wenjian Bi

{"title":"SPAmix: a scalable, accurate, and universal analysis framework for large-scale genetic association studies in admixed populations","authors":"Yuzhuo Ma, He Xu, Ying Li, Hyesung Kim, Lin-lin Xu, Lin Miao, Peng Xu, Fengbiao Mao, Xu-jie Zhou, Wei Zhou, Seunggeun Lee, Ji-Feng Zhang, Peipei Zhang, Wenjian Bi","doi":"10.1186/s13059-025-03827-9","DOIUrl":null,"url":null,"abstract":"Inclusion of individuals with diverse or admixed genetic ancestries is crucial to discover novel findings that may be missed by genomics analyses rooted solely in European population. Here, we present an analysis framework, SPAmix, which is scalable to a large-scale biobank data analysis including hundreds of thousands of admixed individuals and is universally applicable to various types of complex traits including quantitative traits, time-to-event traits, ordinal traits, and longitudinal traits. Since no alternative model is fitted, SPAmix primarily focuses on association p values. For each genetic variant, SPAmix uses genotype data and genetic principal components to estimate individual-specific allele frequency, which is subsequently used to calibrate p values via a retrospective analysis. A hybrid strategy including saddlepoint approximation (SPA) can greatly increase the accuracy to analyze rare genetic variants, especially if the phenotypic distribution is unbalanced or extremely unbalanced. We also propose SPAmixlocal to incorporate local ancestry to calculate ancestry-specific p values. To maximize the statistical powers, SPAmixCCT is proposed to combine the p values of SPAmix and SPAmixlocal via Cauchy combination. The SPAmix-based approaches are more accurate than Tractor to address phenotypic variance heterogeneity among ancestries when analyzing quantitative traits and to address an unbalanced case–control ratio when analyzing binary traits. SPAmixCCT is an optimal unified approach for various cross-ancestry genetic architectures. Extensive simulation studies and real data analyses of 369,314 UK Biobank individuals from multiple ancestries demonstrated that SPAmix is scalable and can discover novel hits while controlling type I error rates well.","PeriodicalId":12611,"journal":{"name":"Genome Biology","volume":"11 1","pages":""},"PeriodicalIF":10.1000,"publicationDate":"2025-10-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Genome Biology","FirstCategoryId":"99","ListUrlMain":"https://doi.org/10.1186/s13059-025-03827-9","RegionNum":1,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"BIOTECHNOLOGY & APPLIED MICROBIOLOGY","Score":null,"Total":0}

引用次数: 0

Abstract

Inclusion of individuals with diverse or admixed genetic ancestries is crucial to discover novel findings that may be missed by genomics analyses rooted solely in European population. Here, we present an analysis framework, SPAmix, which is scalable to a large-scale biobank data analysis including hundreds of thousands of admixed individuals and is universally applicable to various types of complex traits including quantitative traits, time-to-event traits, ordinal traits, and longitudinal traits. Since no alternative model is fitted, SPAmix primarily focuses on association p values. For each genetic variant, SPAmix uses genotype data and genetic principal components to estimate individual-specific allele frequency, which is subsequently used to calibrate p values via a retrospective analysis. A hybrid strategy including saddlepoint approximation (SPA) can greatly increase the accuracy to analyze rare genetic variants, especially if the phenotypic distribution is unbalanced or extremely unbalanced. We also propose SPAmixlocal to incorporate local ancestry to calculate ancestry-specific p values. To maximize the statistical powers, SPAmixCCT is proposed to combine the p values of SPAmix and SPAmixlocal via Cauchy combination. The SPAmix-based approaches are more accurate than Tractor to address phenotypic variance heterogeneity among ancestries when analyzing quantitative traits and to address an unbalanced case–control ratio when analyzing binary traits. SPAmixCCT is an optimal unified approach for various cross-ancestry genetic architectures. Extensive simulation studies and real data analyses of 369,314 UK Biobank individuals from multiple ancestries demonstrated that SPAmix is scalable and can discover novel hits while controlling type I error rates well.

查看原文本刊更多论文

SPAmix：一个可扩展的、准确的、通用的分析框架，用于混合种群的大规模遗传关联研究

包含具有不同或混合遗传祖先的个体对于发现新的发现至关重要，这些发现可能被基因组学分析仅仅植根于欧洲人群而错过。在这里，我们提出了一个分析框架SPAmix，它可扩展到包括数十万混合个体的大规模生物库数据分析，并且普遍适用于各种类型的复杂性状，包括数量性状，时间-事件性状，顺序性状和纵向性状。由于没有拟合其他模型，SPAmix主要关注关联p值。对于每个遗传变异，SPAmix使用基因型数据和遗传主成分来估计个体特异性等位基因频率，随后通过回顾性分析用于校准p值。包括鞍点近似（SPA）在内的杂交策略可以大大提高罕见遗传变异分析的准确性，特别是在表型分布不平衡或极不平衡的情况下。我们还建议使用SPAmixlocal来合并本地祖先以计算特定于祖先的p值。为了使统计能力最大化，SPAmixCCT通过柯西组合将SPAmix和SPAmixlocal的p值组合起来。在分析数量性状时，基于spamix的方法比Tractor更准确地解决了祖先之间的表型方差异质性，在分析二元性状时，解决了不平衡的病例-对照比。SPAmixCCT是各种跨祖先遗传结构的最佳统一方法。广泛的模拟研究和对来自多个祖先的369,314名英国生物银行个体的真实数据分析表明，SPAmix具有可扩展性，可以在很好地控制I型错误率的同时发现新的hit。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Genome Biology Biochemistry, Genetics and Molecular Biology-Genetics

CiteScore

21.00

自引率

3.30%

发文量

241

审稿时长

2 months

期刊介绍： Genome Biology stands as a premier platform for exceptional research across all domains of biology and biomedicine, explored through a genomic and post-genomic lens. With an impressive impact factor of 12.3 (2022),* the journal secures its position as the 3rd-ranked research journal in the Genetics and Heredity category and the 2nd-ranked research journal in the Biotechnology and Applied Microbiology category by Thomson Reuters. Notably, Genome Biology holds the distinction of being the highest-ranked open-access journal in this category. Our dedicated team of highly trained in-house Editors collaborates closely with our esteemed Editorial Board of international experts, ensuring the journal remains on the forefront of scientific advances and community standards. Regular engagement with researchers at conferences and institute visits underscores our commitment to staying abreast of the latest developments in the field.