Yuzhuo Ma, He Xu, Ying Li, Hyesung Kim, Lin-lin Xu, Lin Miao, Peng Xu, Fengbiao Mao, Xu-jie Zhou, Wei Zhou, Seunggeun Lee, Ji-Feng Zhang, Peipei Zhang, Wenjian Bi
{"title":"SPAmix: a scalable, accurate, and universal analysis framework for large-scale genetic association studies in admixed populations","authors":"Yuzhuo Ma, He Xu, Ying Li, Hyesung Kim, Lin-lin Xu, Lin Miao, Peng Xu, Fengbiao Mao, Xu-jie Zhou, Wei Zhou, Seunggeun Lee, Ji-Feng Zhang, Peipei Zhang, Wenjian Bi","doi":"10.1186/s13059-025-03827-9","DOIUrl":null,"url":null,"abstract":"Inclusion of individuals with diverse or admixed genetic ancestries is crucial to discover novel findings that may be missed by genomics analyses rooted solely in European population. Here, we present an analysis framework, SPAmix, which is scalable to a large-scale biobank data analysis including hundreds of thousands of admixed individuals and is universally applicable to various types of complex traits including quantitative traits, time-to-event traits, ordinal traits, and longitudinal traits. Since no alternative model is fitted, SPAmix primarily focuses on association p values. For each genetic variant, SPAmix uses genotype data and genetic principal components to estimate individual-specific allele frequency, which is subsequently used to calibrate p values via a retrospective analysis. A hybrid strategy including saddlepoint approximation (SPA) can greatly increase the accuracy to analyze rare genetic variants, especially if the phenotypic distribution is unbalanced or extremely unbalanced. We also propose SPAmixlocal to incorporate local ancestry to calculate ancestry-specific p values. To maximize the statistical powers, SPAmixCCT is proposed to combine the p values of SPAmix and SPAmixlocal via Cauchy combination. The SPAmix-based approaches are more accurate than Tractor to address phenotypic variance heterogeneity among ancestries when analyzing quantitative traits and to address an unbalanced case–control ratio when analyzing binary traits. SPAmixCCT is an optimal unified approach for various cross-ancestry genetic architectures. Extensive simulation studies and real data analyses of 369,314 UK Biobank individuals from multiple ancestries demonstrated that SPAmix is scalable and can discover novel hits while controlling type I error rates well.","PeriodicalId":12611,"journal":{"name":"Genome Biology","volume":"11 1","pages":""},"PeriodicalIF":10.1000,"publicationDate":"2025-10-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Genome Biology","FirstCategoryId":"99","ListUrlMain":"https://doi.org/10.1186/s13059-025-03827-9","RegionNum":1,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"BIOTECHNOLOGY & APPLIED MICROBIOLOGY","Score":null,"Total":0}
引用次数: 0
Abstract
Inclusion of individuals with diverse or admixed genetic ancestries is crucial to discover novel findings that may be missed by genomics analyses rooted solely in European population. Here, we present an analysis framework, SPAmix, which is scalable to a large-scale biobank data analysis including hundreds of thousands of admixed individuals and is universally applicable to various types of complex traits including quantitative traits, time-to-event traits, ordinal traits, and longitudinal traits. Since no alternative model is fitted, SPAmix primarily focuses on association p values. For each genetic variant, SPAmix uses genotype data and genetic principal components to estimate individual-specific allele frequency, which is subsequently used to calibrate p values via a retrospective analysis. A hybrid strategy including saddlepoint approximation (SPA) can greatly increase the accuracy to analyze rare genetic variants, especially if the phenotypic distribution is unbalanced or extremely unbalanced. We also propose SPAmixlocal to incorporate local ancestry to calculate ancestry-specific p values. To maximize the statistical powers, SPAmixCCT is proposed to combine the p values of SPAmix and SPAmixlocal via Cauchy combination. The SPAmix-based approaches are more accurate than Tractor to address phenotypic variance heterogeneity among ancestries when analyzing quantitative traits and to address an unbalanced case–control ratio when analyzing binary traits. SPAmixCCT is an optimal unified approach for various cross-ancestry genetic architectures. Extensive simulation studies and real data analyses of 369,314 UK Biobank individuals from multiple ancestries demonstrated that SPAmix is scalable and can discover novel hits while controlling type I error rates well.
Genome BiologyBiochemistry, Genetics and Molecular Biology-Genetics
CiteScore
21.00
自引率
3.30%
发文量
241
审稿时长
2 months
期刊介绍:
Genome Biology stands as a premier platform for exceptional research across all domains of biology and biomedicine, explored through a genomic and post-genomic lens.
With an impressive impact factor of 12.3 (2022),* the journal secures its position as the 3rd-ranked research journal in the Genetics and Heredity category and the 2nd-ranked research journal in the Biotechnology and Applied Microbiology category by Thomson Reuters. Notably, Genome Biology holds the distinction of being the highest-ranked open-access journal in this category.
Our dedicated team of highly trained in-house Editors collaborates closely with our esteemed Editorial Board of international experts, ensuring the journal remains on the forefront of scientific advances and community standards. Regular engagement with researchers at conferences and institute visits underscores our commitment to staying abreast of the latest developments in the field.