hmmibd-rs: An enhanced hmmIBD implementation for parallelizable identity-by-descent detection from large-scale Plasmodium genomic data.

Research square Pub Date : 2025-07-02 DOI:10.21203/rs.3.rs-7004070/v1

Bing Guo, Stephen F Schaffner, Aimee R Taylor, Timothy D O'Connor, Shannon Takala-Harrison

{"title":"hmmibd-rs: An enhanced hmmIBD implementation for parallelizable identity-by-descent detection from large-scale Plasmodium genomic data.","authors":"Bing Guo, Stephen F Schaffner, Aimee R Taylor, Timothy D O'Connor, Shannon Takala-Harrison","doi":"10.21203/rs.3.rs-7004070/v1","DOIUrl":null,"url":null,"abstract":"Background: Identity-by-descent (IBD), which describes recent genetic co-ancestry between pairs of genomes, is a fundamental concept in population genomics. It has been used to estimate genetic relatedness, detect selection signals, and understand population demography. The IBD detection method hmmIBD demonstrates high accuracy in inferring IBD segments between haploid genomes, including Plasmodium falciparum, and is widely used in malaria genomic surveillance. However, the current single-threaded implementation of hmmIBD does not utilize the full capacity of multi-processor computers, making it difficult to apply to large data sets, and does not accommodate non-uniform recombination rates across the genome.Methods: We developed an enhanced implementation of hmmIBD in the Rust programming language, named hmmibd-rs, which leverages multi-threaded computing to parallelize IBD inference over genome pairs and which supports optional, user-defined recombination rate maps for more accurate IBD detection and filtration from genomes with non-uniform recombination. We further streamlined large-scale IBD detection by incorporating auxiliary built-in functionalities to preprocess input directly from the standard binary variant call format (BCF) and filter IBD output to reduce disk usage.Results: Our new implementation significantly reduces IBD detection computation time nearly linearly with the increased number of CPU threads used; using 128 threads shortens IBD detection time from 5.2 days to 1.3 hours for 220 million pairs of simulated Plasmodium falciparum-like chromosomes, increasing computational speed by approximately 100x over the single-threaded hmmIBD algorithm. Incorporating non-uniform recombination rates in hmmibd-rs enhances the accuracy of IBD inference by mitigating the overestimation of IBD breakpoints in recombination cold spots and their underestimation in hot spots. It also improves IBD segment length filtration, reducing the false positive rate in recombination cold spots and the false negative rate in hot spots. When applied to empirical data sets, hmmibd-rs completes the detection of IBD from MalariaGEN Pf7 (n ≈ 10,000 monoclonal samples) within hours, enabling a single-day IBD analysis pipeline for large genomic data sets.Conclusion: hmmibd-rs builds upon, accelerates, and enhances hmmIBD for efficient and accurate IBD detection, serving as a crucial tool for advancing large-scale malaria genomic surveillance.","PeriodicalId":519972,"journal":{"name":"Research square","volume":" ","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2025-07-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12236896/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Research square","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.21203/rs.3.rs-7004070/v1","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Background: Identity-by-descent (IBD), which describes recent genetic co-ancestry between pairs of genomes, is a fundamental concept in population genomics. It has been used to estimate genetic relatedness, detect selection signals, and understand population demography. The IBD detection method hmmIBD demonstrates high accuracy in inferring IBD segments between haploid genomes, including Plasmodium falciparum, and is widely used in malaria genomic surveillance. However, the current single-threaded implementation of hmmIBD does not utilize the full capacity of multi-processor computers, making it difficult to apply to large data sets, and does not accommodate non-uniform recombination rates across the genome.

Methods: We developed an enhanced implementation of hmmIBD in the Rust programming language, named hmmibd-rs, which leverages multi-threaded computing to parallelize IBD inference over genome pairs and which supports optional, user-defined recombination rate maps for more accurate IBD detection and filtration from genomes with non-uniform recombination. We further streamlined large-scale IBD detection by incorporating auxiliary built-in functionalities to preprocess input directly from the standard binary variant call format (BCF) and filter IBD output to reduce disk usage.

Results: Our new implementation significantly reduces IBD detection computation time nearly linearly with the increased number of CPU threads used; using 128 threads shortens IBD detection time from 5.2 days to 1.3 hours for 220 million pairs of simulated Plasmodium falciparum-like chromosomes, increasing computational speed by approximately 100x over the single-threaded hmmIBD algorithm. Incorporating non-uniform recombination rates in hmmibd-rs enhances the accuracy of IBD inference by mitigating the overestimation of IBD breakpoints in recombination cold spots and their underestimation in hot spots. It also improves IBD segment length filtration, reducing the false positive rate in recombination cold spots and the false negative rate in hot spots. When applied to empirical data sets, hmmibd-rs completes the detection of IBD from MalariaGEN Pf7 (n ≈ 10,000 monoclonal samples) within hours, enabling a single-day IBD analysis pipeline for large genomic data sets.

Conclusion: hmmibd-rs builds upon, accelerates, and enhances hmmIBD for efficient and accurate IBD detection, serving as a crucial tool for advancing large-scale malaria genomic surveillance.

查看原文本刊更多论文

hmmIBD -rs：一个增强的hmmIBD实现，用于大规模疟原虫基因组数据的并行血统身份检测。

血统同一性（Identity-by-descent， IBD）是群体基因组学的一个基本概念，它描述了基因组对之间的近期遗传共祖。它已被用于估计遗传相关性，检测选择信号，并了解人口统计学。IBD检测方法hmmIBD在推断包括恶性疟原虫在内的单倍体基因组间IBD片段方面具有较高的准确性，在疟疾基因组监测中得到广泛应用。然而，目前的hmmIBD单线程实现并没有充分利用多处理器计算机的全部能力，这使得它很难应用于大型数据集，并且不能适应跨基因组的非均匀重组率。方法利用Rust编程语言开发了一种增强的hmmIBD实现，名为hmmIBD -rs，它利用多线程计算对基因组对并行化IBD推断，并支持可选的、用户自定义的重组率图，以便更准确地检测IBD，并从非均匀重组的基因组中过滤。我们进一步简化了大规模IBD检测，采用了辅助的内置功能来直接预处理来自标准二进制变量调用格式（BCF）的输入，并过滤IBD输出以减少磁盘使用。结果随着CPU线程数的增加，IBD检测的计算时间几乎呈线性减少；使用128个线程将2.2亿对模拟恶性疟原虫样染色体的IBD检测时间从5.2天缩短到1.3小时，计算速度比单线程hmmIBD算法提高了约100倍。在hmmibd-rs中加入非均匀重组率，可以减轻重组冷点对IBD断点的高估和热点对IBD断点的低估，从而提高IBD推断的准确性。改进了IBD片段长度过滤，降低了重组冷点的假阳性率和热点的假阴性率。当应用于经验数据集时，hmmibd-rs可在数小时内完成对来自MalariaGEN Pf7 （n≈10,000个单克隆样本）的IBD检测，从而为大型基因组数据集提供单日IBD分析管道。结论hmmIBD -rs建立在hmmIBD的基础上，加速并增强了hmmIBD对IBD的高效、准确检测，是推进大规模疟疾基因组监测的重要工具。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Research square

自引率

0.00%

发文量