Effloc:一种基于fm指数的大规模生物模式的有效定位算法。

IF 1.4 4区 生物学 Q4 BIOCHEMICAL RESEARCH METHODS
Li-Lu Guo
{"title":"Effloc:一种基于fm指数的大规模生物模式的有效定位算法。","authors":"Li-Lu Guo","doi":"10.1089/cmb.2024.0925","DOIUrl":null,"url":null,"abstract":"<p><p>Pattern locating is a crucial step in various biological sequence analysis tasks. As a compressed full-text indexing technology, full-text minute-space index has been introduced for biological pattern locating over ultra-long genomes, with a low memory footprint and retrieving time independent of genome size. However, its locating time is limited by the number of occurrences of the biological pattern in the genome, and it is not efficient enough when dealing with mass-occurrence biological patterns. To solve this problem, we propose an efficient locating algorithm for mass-occurrence biological patterns in genomic sequence, namely Effloc. It is developed on two optimization techniques. One is that rankings with the same Burrows-Wheeler Transform character are organized into a group and calculated together, thereby reducing the number of last-to-first column (<i>LF</i>) mapping operations required to jump forward to find suffix array (SA) sampling points; the other is to design a specific structure to record the jump status, thus avoiding the redundant <i>LF</i> mapping operations that exist in the process of finding SA sampling points for those adjacent patterns that share the same sampling point. Compared with the existing algorithm, Effloc can significantly reduce the number of time-consuming <i>LF</i> mapping operations in mass-occurrence pattern locating. Ablation experiments verified our algorithm's effectiveness, exhibiting faster locating speed compared with five state-of-the-art competing algorithms. The source code and data are released at https://github.com/Lilu-guo/Effloc.</p>","PeriodicalId":15526,"journal":{"name":"Journal of Computational Biology","volume":" ","pages":""},"PeriodicalIF":1.4000,"publicationDate":"2025-05-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Effloc: An Efficient Locating Algorithm for Mass-Occurrence Biological Patterns with FM-Index.\",\"authors\":\"Li-Lu Guo\",\"doi\":\"10.1089/cmb.2024.0925\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p><p>Pattern locating is a crucial step in various biological sequence analysis tasks. As a compressed full-text indexing technology, full-text minute-space index has been introduced for biological pattern locating over ultra-long genomes, with a low memory footprint and retrieving time independent of genome size. However, its locating time is limited by the number of occurrences of the biological pattern in the genome, and it is not efficient enough when dealing with mass-occurrence biological patterns. To solve this problem, we propose an efficient locating algorithm for mass-occurrence biological patterns in genomic sequence, namely Effloc. It is developed on two optimization techniques. One is that rankings with the same Burrows-Wheeler Transform character are organized into a group and calculated together, thereby reducing the number of last-to-first column (<i>LF</i>) mapping operations required to jump forward to find suffix array (SA) sampling points; the other is to design a specific structure to record the jump status, thus avoiding the redundant <i>LF</i> mapping operations that exist in the process of finding SA sampling points for those adjacent patterns that share the same sampling point. Compared with the existing algorithm, Effloc can significantly reduce the number of time-consuming <i>LF</i> mapping operations in mass-occurrence pattern locating. Ablation experiments verified our algorithm's effectiveness, exhibiting faster locating speed compared with five state-of-the-art competing algorithms. The source code and data are released at https://github.com/Lilu-guo/Effloc.</p>\",\"PeriodicalId\":15526,\"journal\":{\"name\":\"Journal of Computational Biology\",\"volume\":\" \",\"pages\":\"\"},\"PeriodicalIF\":1.4000,\"publicationDate\":\"2025-05-02\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Journal of Computational Biology\",\"FirstCategoryId\":\"99\",\"ListUrlMain\":\"https://doi.org/10.1089/cmb.2024.0925\",\"RegionNum\":4,\"RegionCategory\":\"生物学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q4\",\"JCRName\":\"BIOCHEMICAL RESEARCH METHODS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Computational Biology","FirstCategoryId":"99","ListUrlMain":"https://doi.org/10.1089/cmb.2024.0925","RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q4","JCRName":"BIOCHEMICAL RESEARCH METHODS","Score":null,"Total":0}
引用次数: 0

摘要

模式定位是各种生物序列分析任务的关键步骤。全文分钟空间索引是一种压缩全文索引技术,用于超长基因组的生物模式定位,具有内存占用少、检索时间与基因组大小无关的特点。然而,它的定位时间受到该生物模式在基因组中出现次数的限制,在处理大量出现的生物模式时效率不够高。为了解决这一问题,我们提出了一种高效的基因组序列中大量出现的生物模式定位算法——Effloc。它是在两种优化技术的基础上发展起来的。一是将具有相同Burrows-Wheeler变换特征的排名组织成一组并一起计算,从而减少了前跳查找后缀数组(SA)采样点所需的从最后到第一列(LF)映射操作的次数;二是设计一个特定的结构来记录跳转状态,从而避免了为共享同一采样点的相邻模式寻找SA采样点过程中存在的冗余LF映射操作。与现有算法相比,Effloc可以显著减少大量发生模式定位中耗时的LF映射操作。消融实验验证了该算法的有效性,与五种最先进的竞争算法相比,显示出更快的定位速度。源代码和数据发布在https://github.com/Lilu-guo/Effloc。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
Effloc: An Efficient Locating Algorithm for Mass-Occurrence Biological Patterns with FM-Index.

Pattern locating is a crucial step in various biological sequence analysis tasks. As a compressed full-text indexing technology, full-text minute-space index has been introduced for biological pattern locating over ultra-long genomes, with a low memory footprint and retrieving time independent of genome size. However, its locating time is limited by the number of occurrences of the biological pattern in the genome, and it is not efficient enough when dealing with mass-occurrence biological patterns. To solve this problem, we propose an efficient locating algorithm for mass-occurrence biological patterns in genomic sequence, namely Effloc. It is developed on two optimization techniques. One is that rankings with the same Burrows-Wheeler Transform character are organized into a group and calculated together, thereby reducing the number of last-to-first column (LF) mapping operations required to jump forward to find suffix array (SA) sampling points; the other is to design a specific structure to record the jump status, thus avoiding the redundant LF mapping operations that exist in the process of finding SA sampling points for those adjacent patterns that share the same sampling point. Compared with the existing algorithm, Effloc can significantly reduce the number of time-consuming LF mapping operations in mass-occurrence pattern locating. Ablation experiments verified our algorithm's effectiveness, exhibiting faster locating speed compared with five state-of-the-art competing algorithms. The source code and data are released at https://github.com/Lilu-guo/Effloc.

求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
Journal of Computational Biology
Journal of Computational Biology 生物-计算机:跨学科应用
CiteScore
3.60
自引率
5.90%
发文量
113
审稿时长
6-12 weeks
期刊介绍: Journal of Computational Biology is the leading peer-reviewed journal in computational biology and bioinformatics, publishing in-depth statistical, mathematical, and computational analysis of methods, as well as their practical impact. Available only online, this is an essential journal for scientists and students who want to keep abreast of developments in bioinformatics. Journal of Computational Biology coverage includes: -Genomics -Mathematical modeling and simulation -Distributed and parallel biological computing -Designing biological databases -Pattern matching and pattern detection -Linking disparate databases and data -New tools for computational biology -Relational and object-oriented database technology for bioinformatics -Biological expert system design and use -Reasoning by analogy, hypothesis formation, and testing by machine -Management of biological databases
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信