b-move: faster lossless approximate pattern matching in a run-length compressed index.

IF 1.7 4区生物学 Q4 BIOCHEMICAL RESEARCH METHODS

Algorithms for Molecular Biology Pub Date : 2025-08-12 DOI:10.1186/s13015-025-00281-x

Lore Depuydt, Luca Renders, Simon Van de Vyver, Lennart Veys, Travis Gagie, Jan Fostier

{"title":"b-move: faster lossless approximate pattern matching in a run-length compressed index.","authors":"Lore Depuydt, Luca Renders, Simon Van de Vyver, Lennart Veys, Travis Gagie, Jan Fostier","doi":"10.1186/s13015-025-00281-x","DOIUrl":null,"url":null,"abstract":"Background: Due to the increasing availability of high-quality genome sequences, pan-genomes are gradually replacing single consensus reference genomes in many bioinformatics pipelines to better capture genetic diversity. Traditional bioinformatics tools using the FM-index face memory limitations with such large genome collections. Recent advancements in run-length compressed indices like Gagie et al.'s r-index and Nishimoto and Tabei's move structure, alleviate memory constraints but focus primarily on backward search for MEM-finding. Arakawa et al.'s br-index initiates complete approximate pattern matching using bidirectional search in run-length compressed space, but with significant computational overhead due to complex memory access patterns.Results: We introduce b-move, a novel bidirectional extension of the move structure, enabling fast, cache-efficient, lossless approximate pattern matching in run-length compressed space. It achieves bidirectional character extensions up to 7 times faster than the br-index, closing the performance gap with FM-index-based alternatives. For locating occurrences, b-move performs <math><mi>ϕ</mi></math> and <math><msup><mi>ϕ</mi> <mrow><mo>-</mo> <mn>1</mn></mrow> </msup> </math> operations up to 7 times faster than the br-index. At the same time, it maintains the favorable memory characteristics of the br-index, for example, all available complete E. coli genomes on NCBI's RefSeq collection can be compiled into a b-move index that fits into the RAM of a typical laptop.Conclusions: b-move proves practical and scalable for pan-genome indexing and querying. We provide a C++ implementation of b-move, supporting efficient lossless approximate pattern matching including locate functionality, available at https://github.com/biointec/b-move under the AGPL-3.0 license.","PeriodicalId":50823,"journal":{"name":"Algorithms for Molecular Biology","volume":"20 1","pages":"15"},"PeriodicalIF":1.7000,"publicationDate":"2025-08-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12345024/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Algorithms for Molecular Biology","FirstCategoryId":"99","ListUrlMain":"https://doi.org/10.1186/s13015-025-00281-x","RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q4","JCRName":"BIOCHEMICAL RESEARCH METHODS","Score":null,"Total":0}

引用次数: 0

Abstract

Background: Due to the increasing availability of high-quality genome sequences, pan-genomes are gradually replacing single consensus reference genomes in many bioinformatics pipelines to better capture genetic diversity. Traditional bioinformatics tools using the FM-index face memory limitations with such large genome collections. Recent advancements in run-length compressed indices like Gagie et al.'s r-index and Nishimoto and Tabei's move structure, alleviate memory constraints but focus primarily on backward search for MEM-finding. Arakawa et al.'s br-index initiates complete approximate pattern matching using bidirectional search in run-length compressed space, but with significant computational overhead due to complex memory access patterns.

Results: We introduce b-move, a novel bidirectional extension of the move structure, enabling fast, cache-efficient, lossless approximate pattern matching in run-length compressed space. It achieves bidirectional character extensions up to 7 times faster than the br-index, closing the performance gap with FM-index-based alternatives. For locating occurrences, b-move performs $ϕ$ and $ϕ^{- 1}$ operations up to 7 times faster than the br-index. At the same time, it maintains the favorable memory characteristics of the br-index, for example, all available complete E. coli genomes on NCBI's RefSeq collection can be compiled into a b-move index that fits into the RAM of a typical laptop.

Conclusions: b-move proves practical and scalable for pan-genome indexing and querying. We provide a C++ implementation of b-move, supporting efficient lossless approximate pattern matching including locate functionality, available at https://github.com/biointec/b-move under the AGPL-3.0 license.

查看原文本刊更多论文

B-move：在运行长度压缩索引中更快的无损近似模式匹配。

背景：由于高质量基因组序列的可用性越来越高，泛基因组在许多生物信息学管道中逐渐取代单一共识的参考基因组，以更好地捕获遗传多样性。使用传统生物信息学工具的FM-index在如此大的基因组集合中面临内存限制。最近在运行长度压缩索引方面的进展，如Gagie等人的r-index和Nishimoto和Tabei的move结构，缓解了内存限制，但主要集中在mems查找的向后搜索上。Arakawa等人的br-index在运行长度压缩空间中使用双向搜索启动完整的近似模式匹配，但由于复杂的内存访问模式，计算开销很大。结果：我们引入了b-move，一种新的move结构的双向扩展，在运行长度压缩空间中实现快速，缓存高效，无损的近似模式匹配。它实现双向字符扩展的速度比br索引快7倍，缩小了与基于fm索引的替代品的性能差距。对于定位事件，b-move执行ϕ和ϕ - 1操作比br-index快7倍。同时，它保持了br-index的有利的存储特性，例如，NCBI的RefSeq集合中所有可用的完整大肠杆菌基因组都可以编译成一个b-move索引，适合一台典型笔记本电脑的RAM。结论：b-move对泛基因组的索引和查询具有实用性和可扩展性。我们提供了b-move的c++实现，支持高效的无损近似模式匹配，包括定位功能，可在AGPL-3.0许可下在https://github.com/biointec/b-move获得。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Algorithms for Molecular Biology 生物-生化研究方法

CiteScore

2.40

自引率

10.00%

发文量

审稿时长

>12 weeks

期刊介绍： Algorithms for Molecular Biology publishes articles on novel algorithms for biological sequence and structure analysis, phylogeny reconstruction, and combinatorial algorithms and machine learning. Areas of interest include but are not limited to: algorithms for RNA and protein structure analysis, gene prediction and genome analysis, comparative sequence analysis and alignment, phylogeny, gene expression, machine learning, and combinatorial algorithms. Where appropriate, manuscripts should describe applications to real-world data. However, pure algorithm papers are also welcome if future applications to biological data are to be expected, or if they address complexity or approximation issues of novel computational problems in molecular biology. Articles about novel software tools will be considered for publication if they contain some algorithmically interesting aspects.