Jinglan Dai, Yixin Zhang, Yuan Gao, Hongru Li, Sha Du, Hao Hong, Dongfang You, Zaiming Li, Ruyang Zhang, Yang Zhao, Zhonghua Liu, David C Christiani, Feng Chen, Sipeng Shen
{"title":"利用大规模测序群体的代入提高罕见变异关联研究的能力。","authors":"Jinglan Dai, Yixin Zhang, Yuan Gao, Hongru Li, Sha Du, Hao Hong, Dongfang You, Zaiming Li, Ruyang Zhang, Yang Zhao, Zhonghua Liu, David C Christiani, Feng Chen, Sipeng Shen","doi":"10.1093/gpbjnl/qzaf084","DOIUrl":null,"url":null,"abstract":"<p><p>With the emergence of population-scale whole-genome sequencing (WGS), rare variants can be captured precisely. Studying rare variants explains part of the heritability of complex traits that is ignored by conventional genome-wide association studies (GWASs). However, how much the power of using imputed data can approximate or improve that of using WGS in rare variant association studies remains unclear. Using WGS (n = 150,119) as the ground truth, we first evaluated the consistency of rare variants in the single nucleotide polymorphism (SNP) array imputed from TOPMed or HRC+UK10K in the UK Biobank. Imputation quality (average R-square of the TOPMed-imputed data could reach 0.6 for even extremely rare variants with minor allele count ≤ 5. TOPMed-imputed data were closer to WGS for three ethnicities with the average Cramer's V > 0.75. Furthermore, association tests were performed on 45 traits. Under the same sample size, neither of the two imputed data outperformed WGS, but the results of TOPMed-imputed data were more consistent with WGS. When the sample size increased to n = 488,377, the number of identified rare variants in TOPMed-imputed data increased by 27.71% for quantitative traits and approximately 10-fold for binary traits. Finally, we meta-analyzed the association results of SNP array and WGS for lung cancer and epithelial ovarian cancer respectively. Compared to WGS-based results, more variants and genes could be identified. Our findings highlight that incorporating rare variants imputed from large-scale sequencing populations can boost the power of rare variant association tests when WGS has limited sample sizes.</p>","PeriodicalId":94020,"journal":{"name":"Genomics, proteomics & bioinformatics","volume":" ","pages":""},"PeriodicalIF":7.9000,"publicationDate":"2025-09-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Boosting the Power of Rare Variant Association Studies by Imputation Using Large-scale Sequencing Population.\",\"authors\":\"Jinglan Dai, Yixin Zhang, Yuan Gao, Hongru Li, Sha Du, Hao Hong, Dongfang You, Zaiming Li, Ruyang Zhang, Yang Zhao, Zhonghua Liu, David C Christiani, Feng Chen, Sipeng Shen\",\"doi\":\"10.1093/gpbjnl/qzaf084\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p><p>With the emergence of population-scale whole-genome sequencing (WGS), rare variants can be captured precisely. Studying rare variants explains part of the heritability of complex traits that is ignored by conventional genome-wide association studies (GWASs). However, how much the power of using imputed data can approximate or improve that of using WGS in rare variant association studies remains unclear. Using WGS (n = 150,119) as the ground truth, we first evaluated the consistency of rare variants in the single nucleotide polymorphism (SNP) array imputed from TOPMed or HRC+UK10K in the UK Biobank. Imputation quality (average R-square of the TOPMed-imputed data could reach 0.6 for even extremely rare variants with minor allele count ≤ 5. TOPMed-imputed data were closer to WGS for three ethnicities with the average Cramer's V > 0.75. Furthermore, association tests were performed on 45 traits. Under the same sample size, neither of the two imputed data outperformed WGS, but the results of TOPMed-imputed data were more consistent with WGS. When the sample size increased to n = 488,377, the number of identified rare variants in TOPMed-imputed data increased by 27.71% for quantitative traits and approximately 10-fold for binary traits. Finally, we meta-analyzed the association results of SNP array and WGS for lung cancer and epithelial ovarian cancer respectively. Compared to WGS-based results, more variants and genes could be identified. Our findings highlight that incorporating rare variants imputed from large-scale sequencing populations can boost the power of rare variant association tests when WGS has limited sample sizes.</p>\",\"PeriodicalId\":94020,\"journal\":{\"name\":\"Genomics, proteomics & bioinformatics\",\"volume\":\" \",\"pages\":\"\"},\"PeriodicalIF\":7.9000,\"publicationDate\":\"2025-09-17\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Genomics, proteomics & bioinformatics\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1093/gpbjnl/qzaf084\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Genomics, proteomics & bioinformatics","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1093/gpbjnl/qzaf084","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Boosting the Power of Rare Variant Association Studies by Imputation Using Large-scale Sequencing Population.
With the emergence of population-scale whole-genome sequencing (WGS), rare variants can be captured precisely. Studying rare variants explains part of the heritability of complex traits that is ignored by conventional genome-wide association studies (GWASs). However, how much the power of using imputed data can approximate or improve that of using WGS in rare variant association studies remains unclear. Using WGS (n = 150,119) as the ground truth, we first evaluated the consistency of rare variants in the single nucleotide polymorphism (SNP) array imputed from TOPMed or HRC+UK10K in the UK Biobank. Imputation quality (average R-square of the TOPMed-imputed data could reach 0.6 for even extremely rare variants with minor allele count ≤ 5. TOPMed-imputed data were closer to WGS for three ethnicities with the average Cramer's V > 0.75. Furthermore, association tests were performed on 45 traits. Under the same sample size, neither of the two imputed data outperformed WGS, but the results of TOPMed-imputed data were more consistent with WGS. When the sample size increased to n = 488,377, the number of identified rare variants in TOPMed-imputed data increased by 27.71% for quantitative traits and approximately 10-fold for binary traits. Finally, we meta-analyzed the association results of SNP array and WGS for lung cancer and epithelial ovarian cancer respectively. Compared to WGS-based results, more variants and genes could be identified. Our findings highlight that incorporating rare variants imputed from large-scale sequencing populations can boost the power of rare variant association tests when WGS has limited sample sizes.