Bing Guo, Stephen F Schaffner, Aimee R Taylor, Timothy D O'Connor, Shannon Takala-Harrison
{"title":"hmmibd-rs: An enhanced hmmIBD implementation for parallelizable identity-by-descent detection from large-scale Plasmodium genomic data.","authors":"Bing Guo, Stephen F Schaffner, Aimee R Taylor, Timothy D O'Connor, Shannon Takala-Harrison","doi":"10.21203/rs.3.rs-7004070/v1","DOIUrl":null,"url":null,"abstract":"<p><strong>Background: </strong>Identity-by-descent (IBD), which describes recent genetic co-ancestry between pairs of genomes, is a fundamental concept in population genomics. It has been used to estimate genetic relatedness, detect selection signals, and understand population demography. The IBD detection method <i>hmmIBD</i> demonstrates high accuracy in inferring IBD segments between haploid genomes, including <i>Plasmodium falciparum</i>, and is widely used in malaria genomic surveillance. However, the current single-threaded implementation of <i>hmmIBD</i> does not utilize the full capacity of multi-processor computers, making it difficult to apply to large data sets, and does not accommodate non-uniform recombination rates across the genome.</p><p><strong>Methods: </strong>We developed an enhanced implementation of <i>hmmIBD</i> in the Rust programming language, named <i>hmmibd-rs</i>, which leverages multi-threaded computing to parallelize IBD inference over genome pairs and which supports optional, user-defined recombination rate maps for more accurate IBD detection and filtration from genomes with non-uniform recombination. We further streamlined large-scale IBD detection by incorporating auxiliary built-in functionalities to preprocess input directly from the standard binary variant call format (BCF) and filter IBD output to reduce disk usage.</p><p><strong>Results: </strong>Our new implementation significantly reduces IBD detection computation time nearly linearly with the increased number of CPU threads used; using 128 threads shortens IBD detection time from 5.2 days to 1.3 hours for 220 million pairs of simulated <i>Plasmodium falciparum</i>-like chromosomes, increasing computational speed by approximately 100x over the single-threaded <i>hmmIBD</i> algorithm. Incorporating non-uniform recombination rates in <i>hmmibd-rs</i> enhances the accuracy of IBD inference by mitigating the overestimation of IBD breakpoints in recombination cold spots and their underestimation in hot spots. It also improves IBD segment length filtration, reducing the false positive rate in recombination cold spots and the false negative rate in hot spots. When applied to empirical data sets, <i>hmmibd-rs</i> completes the detection of IBD from MalariaGEN Pf7 (n ≈ 10,000 monoclonal samples) within hours, enabling a single-day IBD analysis pipeline for large genomic data sets.</p><p><strong>Conclusion: </strong><i>hmmibd-rs</i> builds upon, accelerates, and enhances <i>hmmIBD</i> for efficient and accurate IBD detection, serving as a crucial tool for advancing large-scale malaria genomic surveillance.</p>","PeriodicalId":519972,"journal":{"name":"Research square","volume":" ","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2025-07-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12236896/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Research square","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.21203/rs.3.rs-7004070/v1","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
Background: Identity-by-descent (IBD), which describes recent genetic co-ancestry between pairs of genomes, is a fundamental concept in population genomics. It has been used to estimate genetic relatedness, detect selection signals, and understand population demography. The IBD detection method hmmIBD demonstrates high accuracy in inferring IBD segments between haploid genomes, including Plasmodium falciparum, and is widely used in malaria genomic surveillance. However, the current single-threaded implementation of hmmIBD does not utilize the full capacity of multi-processor computers, making it difficult to apply to large data sets, and does not accommodate non-uniform recombination rates across the genome.
Methods: We developed an enhanced implementation of hmmIBD in the Rust programming language, named hmmibd-rs, which leverages multi-threaded computing to parallelize IBD inference over genome pairs and which supports optional, user-defined recombination rate maps for more accurate IBD detection and filtration from genomes with non-uniform recombination. We further streamlined large-scale IBD detection by incorporating auxiliary built-in functionalities to preprocess input directly from the standard binary variant call format (BCF) and filter IBD output to reduce disk usage.
Results: Our new implementation significantly reduces IBD detection computation time nearly linearly with the increased number of CPU threads used; using 128 threads shortens IBD detection time from 5.2 days to 1.3 hours for 220 million pairs of simulated Plasmodium falciparum-like chromosomes, increasing computational speed by approximately 100x over the single-threaded hmmIBD algorithm. Incorporating non-uniform recombination rates in hmmibd-rs enhances the accuracy of IBD inference by mitigating the overestimation of IBD breakpoints in recombination cold spots and their underestimation in hot spots. It also improves IBD segment length filtration, reducing the false positive rate in recombination cold spots and the false negative rate in hot spots. When applied to empirical data sets, hmmibd-rs completes the detection of IBD from MalariaGEN Pf7 (n ≈ 10,000 monoclonal samples) within hours, enabling a single-day IBD analysis pipeline for large genomic data sets.
Conclusion: hmmibd-rs builds upon, accelerates, and enhances hmmIBD for efficient and accurate IBD detection, serving as a crucial tool for advancing large-scale malaria genomic surveillance.