hmmibd-rs: An enhanced hmmIBD implementation for parallelizable identity-by-descent detection from large-scale Plasmodium genomic data.

Bing Guo, Stephen F Schaffner, Aimee R Taylor, Timothy D O'Connor, Shannon Takala-Harrison
{"title":"hmmibd-rs: An enhanced hmmIBD implementation for parallelizable identity-by-descent detection from large-scale Plasmodium genomic data.","authors":"Bing Guo, Stephen F Schaffner, Aimee R Taylor, Timothy D O'Connor, Shannon Takala-Harrison","doi":"10.21203/rs.3.rs-7004070/v1","DOIUrl":null,"url":null,"abstract":"<p><strong>Background: </strong>Identity-by-descent (IBD), which describes recent genetic co-ancestry between pairs of genomes, is a fundamental concept in population genomics. It has been used to estimate genetic relatedness, detect selection signals, and understand population demography. The IBD detection method <i>hmmIBD</i> demonstrates high accuracy in inferring IBD segments between haploid genomes, including <i>Plasmodium falciparum</i>, and is widely used in malaria genomic surveillance. However, the current single-threaded implementation of <i>hmmIBD</i> does not utilize the full capacity of multi-processor computers, making it difficult to apply to large data sets, and does not accommodate non-uniform recombination rates across the genome.</p><p><strong>Methods: </strong>We developed an enhanced implementation of <i>hmmIBD</i> in the Rust programming language, named <i>hmmibd-rs</i>, which leverages multi-threaded computing to parallelize IBD inference over genome pairs and which supports optional, user-defined recombination rate maps for more accurate IBD detection and filtration from genomes with non-uniform recombination. We further streamlined large-scale IBD detection by incorporating auxiliary built-in functionalities to preprocess input directly from the standard binary variant call format (BCF) and filter IBD output to reduce disk usage.</p><p><strong>Results: </strong>Our new implementation significantly reduces IBD detection computation time nearly linearly with the increased number of CPU threads used; using 128 threads shortens IBD detection time from 5.2 days to 1.3 hours for 220 million pairs of simulated <i>Plasmodium falciparum</i>-like chromosomes, increasing computational speed by approximately 100x over the single-threaded <i>hmmIBD</i> algorithm. Incorporating non-uniform recombination rates in <i>hmmibd-rs</i> enhances the accuracy of IBD inference by mitigating the overestimation of IBD breakpoints in recombination cold spots and their underestimation in hot spots. It also improves IBD segment length filtration, reducing the false positive rate in recombination cold spots and the false negative rate in hot spots. When applied to empirical data sets, <i>hmmibd-rs</i> completes the detection of IBD from MalariaGEN Pf7 (n ≈ 10,000 monoclonal samples) within hours, enabling a single-day IBD analysis pipeline for large genomic data sets.</p><p><strong>Conclusion: </strong><i>hmmibd-rs</i> builds upon, accelerates, and enhances <i>hmmIBD</i> for efficient and accurate IBD detection, serving as a crucial tool for advancing large-scale malaria genomic surveillance.</p>","PeriodicalId":519972,"journal":{"name":"Research square","volume":" ","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2025-07-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12236896/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Research square","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.21203/rs.3.rs-7004070/v1","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

Background: Identity-by-descent (IBD), which describes recent genetic co-ancestry between pairs of genomes, is a fundamental concept in population genomics. It has been used to estimate genetic relatedness, detect selection signals, and understand population demography. The IBD detection method hmmIBD demonstrates high accuracy in inferring IBD segments between haploid genomes, including Plasmodium falciparum, and is widely used in malaria genomic surveillance. However, the current single-threaded implementation of hmmIBD does not utilize the full capacity of multi-processor computers, making it difficult to apply to large data sets, and does not accommodate non-uniform recombination rates across the genome.

Methods: We developed an enhanced implementation of hmmIBD in the Rust programming language, named hmmibd-rs, which leverages multi-threaded computing to parallelize IBD inference over genome pairs and which supports optional, user-defined recombination rate maps for more accurate IBD detection and filtration from genomes with non-uniform recombination. We further streamlined large-scale IBD detection by incorporating auxiliary built-in functionalities to preprocess input directly from the standard binary variant call format (BCF) and filter IBD output to reduce disk usage.

Results: Our new implementation significantly reduces IBD detection computation time nearly linearly with the increased number of CPU threads used; using 128 threads shortens IBD detection time from 5.2 days to 1.3 hours for 220 million pairs of simulated Plasmodium falciparum-like chromosomes, increasing computational speed by approximately 100x over the single-threaded hmmIBD algorithm. Incorporating non-uniform recombination rates in hmmibd-rs enhances the accuracy of IBD inference by mitigating the overestimation of IBD breakpoints in recombination cold spots and their underestimation in hot spots. It also improves IBD segment length filtration, reducing the false positive rate in recombination cold spots and the false negative rate in hot spots. When applied to empirical data sets, hmmibd-rs completes the detection of IBD from MalariaGEN Pf7 (n ≈ 10,000 monoclonal samples) within hours, enabling a single-day IBD analysis pipeline for large genomic data sets.

Conclusion: hmmibd-rs builds upon, accelerates, and enhances hmmIBD for efficient and accurate IBD detection, serving as a crucial tool for advancing large-scale malaria genomic surveillance.

hmmIBD -rs:一个增强的hmmIBD实现,用于大规模疟原虫基因组数据的并行血统身份检测。
血统同一性(Identity-by-descent, IBD)是群体基因组学的一个基本概念,它描述了基因组对之间的近期遗传共祖。它已被用于估计遗传相关性,检测选择信号,并了解人口统计学。IBD检测方法hmmIBD在推断包括恶性疟原虫在内的单倍体基因组间IBD片段方面具有较高的准确性,在疟疾基因组监测中得到广泛应用。然而,目前的hmmIBD单线程实现并没有充分利用多处理器计算机的全部能力,这使得它很难应用于大型数据集,并且不能适应跨基因组的非均匀重组率。方法利用Rust编程语言开发了一种增强的hmmIBD实现,名为hmmIBD -rs,它利用多线程计算对基因组对并行化IBD推断,并支持可选的、用户自定义的重组率图,以便更准确地检测IBD,并从非均匀重组的基因组中过滤。我们进一步简化了大规模IBD检测,采用了辅助的内置功能来直接预处理来自标准二进制变量调用格式(BCF)的输入,并过滤IBD输出以减少磁盘使用。结果随着CPU线程数的增加,IBD检测的计算时间几乎呈线性减少;使用128个线程将2.2亿对模拟恶性疟原虫样染色体的IBD检测时间从5.2天缩短到1.3小时,计算速度比单线程hmmIBD算法提高了约100倍。在hmmibd-rs中加入非均匀重组率,可以减轻重组冷点对IBD断点的高估和热点对IBD断点的低估,从而提高IBD推断的准确性。改进了IBD片段长度过滤,降低了重组冷点的假阳性率和热点的假阴性率。当应用于经验数据集时,hmmibd-rs可在数小时内完成对来自MalariaGEN Pf7 (n≈10,000个单克隆样本)的IBD检测,从而为大型基因组数据集提供单日IBD分析管道。结论hmmIBD -rs建立在hmmIBD的基础上,加速并增强了hmmIBD对IBD的高效、准确检测,是推进大规模疟疾基因组监测的重要工具。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信