大基因型：一种基于图的方法，用于小的和结构变异的群体基因分型。

IF 11.8 2区生物学 Q1 MULTIDISCIPLINARY SCIENCES

GigaScience Pub Date : 2025-01-06 DOI:10.1093/gigascience/giaf112

Moustafa Shokrof, Mohamed Abuelanin, C Titus Brown, Tamer A Mansour

{"title":"大基因型：一种基于图的方法，用于小的和结构变异的群体基因分型。","authors":"Moustafa Shokrof, Mohamed Abuelanin, C Titus Brown, Tamer A Mansour","doi":"10.1093/gigascience/giaf112","DOIUrl":null,"url":null,"abstract":"Background: Long-read sequencing (LRS) enables high-quality structural variant (SV) discovery. SV genotypers utilize these precise call sets to improve the recall and precision of genotyping in short-read sequencing (SRS) samples. With the extensive growth in publicly available SRS datasets, it is now possible to calculate accurate population allele frequencies of SVs. However, reprocessing hundreds of terabytes of raw SRS data to genotype new variants is impractical for population-scale studies, a computational challenge known as the N+1 problem (i.e., the challenge of re-genotyping an entire cohort for one additional variant). Overcoming this computational bottleneck is essential for analyzing new SVs from the growing number of pangenomes, public genomic databases, and pathogenic variant discovery studies.Results: We propose the Great Genotyper, a population-scale genotyping workflow to address the N+1 problem. Applied to a human dataset, the workflow begins by preprocessing 4.2k short-read samples of a total of 183 TB raw data to create an 867-GB Counting Colored de Bruijn Graph (CCDG). The Great Genotyper uses this CCDG to genotype a list of phased or unphased variants, leveraging the CCDG population information to increase both precision and recall. The Great Genotyper offers the same accuracy as the state-of-the-art genotypers while achieving unprecedented performance. It took about 100 hours to genotype 4.5M variants across the 4.2k samples and calculate their population allele frequencies using 1 server with 32 cores and 145 GB of memory. The Great Genotyper opens the door to new ways to study SVs. For example, using the premade index, we demonstrate the Great Genotyper's application in finding pathogenic variants by calculating accurate allele frequency for novel SVs. Also, we used it to create a 4k reference panel by genotyping variants from the Human Pangenome Reference Consortium (HPRC). The new reference panel allows for SV imputation from genotyping microarrays. Moreover, we genotype the human GWAS Catalog and merge its variants with the 4k reference panel. We show 6,253 events of high linkage between the HPRC's SVs and nearby GWAS single-nucleotide polymorphisms, which can help in interpreting the effect of these SVs on gene functions. This analysis uncovers the detailed haplotype structure of the human fibrinogen locus and revives the pathogenic association of a 28-bp insertion in the FGA gene with thromboembolic disorders.Conclusion: The Great Genotyper solves the N+1 problem for population-scale genotyping of small and structural variants, offering both high accuracy and efficiency. Its ability to rapidly re-genotype large cohorts paves the road for several new studies of SVs.","PeriodicalId":12581,"journal":{"name":"GigaScience","volume":"14 ","pages":""},"PeriodicalIF":11.8000,"publicationDate":"2025-01-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12491952/pdf/","citationCount":"0","resultStr":"{\"title\":\"The Great Genotyper: a graph-based method for population genotyping of small and structural variants.\",\"authors\":\"Moustafa Shokrof, Mohamed Abuelanin, C Titus Brown, Tamer A Mansour\",\"doi\":\"10.1093/gigascience/giaf112\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Background: Long-read sequencing (LRS) enables high-quality structural variant (SV) discovery. SV genotypers utilize these precise call sets to improve the recall and precision of genotyping in short-read sequencing (SRS) samples. With the extensive growth in publicly available SRS datasets, it is now possible to calculate accurate population allele frequencies of SVs. However, reprocessing hundreds of terabytes of raw SRS data to genotype new variants is impractical for population-scale studies, a computational challenge known as the N+1 problem (i.e., the challenge of re-genotyping an entire cohort for one additional variant). Overcoming this computational bottleneck is essential for analyzing new SVs from the growing number of pangenomes, public genomic databases, and pathogenic variant discovery studies.Results: We propose the Great Genotyper, a population-scale genotyping workflow to address the N+1 problem. Applied to a human dataset, the workflow begins by preprocessing 4.2k short-read samples of a total of 183 TB raw data to create an 867-GB Counting Colored de Bruijn Graph (CCDG). The Great Genotyper uses this CCDG to genotype a list of phased or unphased variants, leveraging the CCDG population information to increase both precision and recall. The Great Genotyper offers the same accuracy as the state-of-the-art genotypers while achieving unprecedented performance. It took about 100 hours to genotype 4.5M variants across the 4.2k samples and calculate their population allele frequencies using 1 server with 32 cores and 145 GB of memory. The Great Genotyper opens the door to new ways to study SVs. For example, using the premade index, we demonstrate the Great Genotyper's application in finding pathogenic variants by calculating accurate allele frequency for novel SVs. Also, we used it to create a 4k reference panel by genotyping variants from the Human Pangenome Reference Consortium (HPRC). The new reference panel allows for SV imputation from genotyping microarrays. Moreover, we genotype the human GWAS Catalog and merge its variants with the 4k reference panel. We show 6,253 events of high linkage between the HPRC's SVs and nearby GWAS single-nucleotide polymorphisms, which can help in interpreting the effect of these SVs on gene functions. This analysis uncovers the detailed haplotype structure of the human fibrinogen locus and revives the pathogenic association of a 28-bp insertion in the FGA gene with thromboembolic disorders.Conclusion: The Great Genotyper solves the N+1 problem for population-scale genotyping of small and structural variants, offering both high accuracy and efficiency. Its ability to rapidly re-genotype large cohorts paves the road for several new studies of SVs.\",\"PeriodicalId\":12581,\"journal\":{\"name\":\"GigaScience\",\"volume\":\"14 \",\"pages\":\"\"},\"PeriodicalIF\":11.8000,\"publicationDate\":\"2025-01-06\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12491952/pdf/\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"GigaScience\",\"FirstCategoryId\":\"99\",\"ListUrlMain\":\"https://doi.org/10.1093/gigascience/giaf112\",\"RegionNum\":2,\"RegionCategory\":\"生物学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"MULTIDISCIPLINARY SCIENCES\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"GigaScience","FirstCategoryId":"99","ListUrlMain":"https://doi.org/10.1093/gigascience/giaf112","RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"MULTIDISCIPLINARY SCIENCES","Score":null,"Total":0}

引用次数: 0

摘要

背景：长读测序（LRS）能够实现高质量的结构变异（SV）发现。SV基因型利用这些精确的调用集来提高短读测序（SRS）样本基因分型的召回率和精度。随着公开可用的SRS数据集的广泛增长，现在有可能计算出准确的SVs群体等位基因频率。然而，重新处理数百tb的原始SRS数据以对新变体进行基因分型对于群体规模的研究是不切实际的，这是一个被称为N+1问题的计算挑战（即为一个额外的变体重新对整个队列进行基因分型的挑战）。克服这一计算瓶颈对于从越来越多的泛基因组、公共基因组数据库和致病变异发现研究中分析新的sv至关重要。结果：我们提出了大基因型，一个群体规模的基因分型工作流程来解决N+1问题。应用于人类数据集，工作流程首先预处理总计183 TB原始数据的4.2万个短读样本，以创建867 gb计数彩色德布鲁因图（CCDG）。大基因型使用该CCDG对阶段性或非阶段性变异列表进行基因型，利用CCDG种群信息来提高准确性和召回率。伟大的基因型提供相同的准确性，最先进的基因型，同时实现前所未有的性能。使用一台32核、145gb内存的服务器，在4.2万个样本中对450万个变异进行基因分型，并计算它们的种群等位基因频率，耗时约100小时。大基因型为研究sv的新方法打开了大门。例如，我们使用预先编制的索引，通过计算新的sv的精确等位基因频率，展示了Great genotype在寻找致病变异方面的应用。此外，我们还利用它创建了一个4k的参考面板，通过对来自人类泛基因组参考联盟（HPRC）的变异进行基因分型。新的参考面板允许从基因分型微阵列插入SV。此外，我们对人类GWAS目录进行基因分型，并将其变体与4k参考面板合并。我们发现了6253个HPRC的SVs与附近的GWAS单核苷酸多态性之间的高连锁事件，这有助于解释这些SVs对基因功能的影响。该分析揭示了人类纤维蛋白原位点的详细单倍型结构，并恢复了FGA基因中28bp插入与血栓栓塞性疾病的致病关联。结论：Great Genotyper解决了群体尺度小变异和结构变异基因分型的N+1问题，具有较高的准确性和效率。它快速重新对大群体进行基因分型的能力为几项新的sv研究铺平了道路。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

The Great Genotyper: a graph-based method for population genotyping of small and structural variants.

Background: Long-read sequencing (LRS) enables high-quality structural variant (SV) discovery. SV genotypers utilize these precise call sets to improve the recall and precision of genotyping in short-read sequencing (SRS) samples. With the extensive growth in publicly available SRS datasets, it is now possible to calculate accurate population allele frequencies of SVs. However, reprocessing hundreds of terabytes of raw SRS data to genotype new variants is impractical for population-scale studies, a computational challenge known as the N+1 problem (i.e., the challenge of re-genotyping an entire cohort for one additional variant). Overcoming this computational bottleneck is essential for analyzing new SVs from the growing number of pangenomes, public genomic databases, and pathogenic variant discovery studies.

Results: We propose the Great Genotyper, a population-scale genotyping workflow to address the N+1 problem. Applied to a human dataset, the workflow begins by preprocessing 4.2k short-read samples of a total of 183 TB raw data to create an 867-GB Counting Colored de Bruijn Graph (CCDG). The Great Genotyper uses this CCDG to genotype a list of phased or unphased variants, leveraging the CCDG population information to increase both precision and recall. The Great Genotyper offers the same accuracy as the state-of-the-art genotypers while achieving unprecedented performance. It took about 100 hours to genotype 4.5M variants across the 4.2k samples and calculate their population allele frequencies using 1 server with 32 cores and 145 GB of memory. The Great Genotyper opens the door to new ways to study SVs. For example, using the premade index, we demonstrate the Great Genotyper's application in finding pathogenic variants by calculating accurate allele frequency for novel SVs. Also, we used it to create a 4k reference panel by genotyping variants from the Human Pangenome Reference Consortium (HPRC). The new reference panel allows for SV imputation from genotyping microarrays. Moreover, we genotype the human GWAS Catalog and merge its variants with the 4k reference panel. We show 6,253 events of high linkage between the HPRC's SVs and nearby GWAS single-nucleotide polymorphisms, which can help in interpreting the effect of these SVs on gene functions. This analysis uncovers the detailed haplotype structure of the human fibrinogen locus and revives the pathogenic association of a 28-bp insertion in the FGA gene with thromboembolic disorders.

Conclusion: The Great Genotyper solves the N+1 problem for population-scale genotyping of small and structural variants, offering both high accuracy and efficiency. Its ability to rapidly re-genotype large cohorts paves the road for several new studies of SVs.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

GigaScience MULTIDISCIPLINARY SCIENCES-

CiteScore

15.50

自引率

1.10%

发文量

119

审稿时长

1 weeks

期刊介绍： GigaScience seeks to transform data dissemination and utilization in the life and biomedical sciences. As an online open-access open-data journal, it specializes in publishing "big-data" studies encompassing various fields. Its scope includes not only "omic" type data and the fields of high-throughput biology currently serviced by large public repositories, but also the growing range of more difficult-to-access data, such as imaging, neuroscience, ecology, cohort data, systems biology and other new types of large-scale shareable data.