Non-overlapping indexing in BWT-runs bounded space

IF 1 4区 计算机科学 Q3 COMPUTER SCIENCE, THEORY & METHODS
Daniel Gibney , Paul MacNichol , Sharma V. Thankachan
{"title":"Non-overlapping indexing in BWT-runs bounded space","authors":"Daniel Gibney ,&nbsp;Paul MacNichol ,&nbsp;Sharma V. Thankachan","doi":"10.1016/j.tcs.2025.115512","DOIUrl":null,"url":null,"abstract":"<div><div>We revisit the non-overlapping indexing problem for an efficient repetition-aware solution. The problem is to index a text <span><math><mi>T</mi><mo>[</mo><mn>1</mn><mo>.</mo><mo>.</mo><mi>n</mi><mo>]</mo></math></span>, such that whenever a pattern <span><math><mi>P</mi><mo>[</mo><mn>1</mn><mo>.</mo><mo>.</mo><mi>p</mi><mo>]</mo></math></span> comes as a query, we can report the largest set of non-overlapping occurrences of <em>P</em> in <em>T</em>. A previous index by Cohen and Porat [ISAAC 2009] takes linear space and optimal <span><math><mi>O</mi><mo>(</mo><mi>p</mi><mo>+</mo><mrow><mi>oc</mi><msub><mrow><mi>c</mi></mrow><mrow><mi>no</mi></mrow></msub></mrow><mo>)</mo></math></span> query time, where <span><math><mi>oc</mi><msub><mrow><mi>c</mi></mrow><mrow><mi>no</mi></mrow></msub></math></span> denotes the output size. We present an index of size <span><math><mi>O</mi><mo>(</mo><mi>r</mi><mo>)</mo></math></span>, where <em>r</em> denotes the number of runs in the Burrows Wheeler Transform (BWT) of <em>T</em>. The parameter <em>r</em> is significantly smaller than <em>n</em> for highly repetitive texts. The query time of our index is <span><math><mi>O</mi><mo>(</mo><mi>p</mi><mi>log</mi><mo>⁡</mo><msub><mrow><mi>log</mi></mrow><mrow><mi>w</mi></mrow></msub><mo>⁡</mo><mi>σ</mi><mo>+</mo><mrow><mi>sort</mi></mrow><mo>(</mo><mrow><mi>oc</mi><msub><mrow><mi>c</mi></mrow><mrow><mi>no</mi></mrow></msub></mrow><mo>)</mo><mo>)</mo></math></span>, where <em>σ</em> denotes the alphabet size, <em>w</em> denotes the machine word size in bits and <span><math><mrow><mi>sort</mi></mrow><mo>(</mo><mi>x</mi><mo>)</mo></math></span> denotes the time for sorting <em>x</em> integers within the range <span><math><mo>[</mo><mn>1</mn><mo>,</mo><mi>n</mi><mo>]</mo></math></span>. We also study the counting version of this problem.</div></div>","PeriodicalId":49438,"journal":{"name":"Theoretical Computer Science","volume":"1056 ","pages":"Article 115512"},"PeriodicalIF":1.0000,"publicationDate":"2025-08-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Theoretical Computer Science","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0304397525004505","RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"COMPUTER SCIENCE, THEORY & METHODS","Score":null,"Total":0}
引用次数: 0

Abstract

We revisit the non-overlapping indexing problem for an efficient repetition-aware solution. The problem is to index a text T[1..n], such that whenever a pattern P[1..p] comes as a query, we can report the largest set of non-overlapping occurrences of P in T. A previous index by Cohen and Porat [ISAAC 2009] takes linear space and optimal O(p+occno) query time, where occno denotes the output size. We present an index of size O(r), where r denotes the number of runs in the Burrows Wheeler Transform (BWT) of T. The parameter r is significantly smaller than n for highly repetitive texts. The query time of our index is O(ploglogwσ+sort(occno)), where σ denotes the alphabet size, w denotes the machine word size in bits and sort(x) denotes the time for sorting x integers within the range [1,n]. We also study the counting version of this problem.
bwt运行有界空间中的非重叠索引
我们重新审视非重叠索引问题,以获得有效的重复感知解决方案。问题是索引文本T[1..]n],这样每当一个模式P[1..]p]作为查询,我们可以报告t中p非重叠出现的最大集合。Cohen和Porat [ISAAC 2009]的先前索引占用线性空间和最优的O(p+occno)查询时间,其中occno表示输出大小。我们给出了一个大小为O(r)的索引,其中r表示t的Burrows Wheeler变换(BWT)的运行次数。对于高度重复的文本,参数r明显小于n。我们索引的查询时间为O(plog (logw)) +sort(occno),其中σ表示字母表大小,w表示机器字长(以位为单位),sort(x)表示对范围[1,n]内的x个整数排序的时间。我们还研究了这个问题的计数版本。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
Theoretical Computer Science
Theoretical Computer Science 工程技术-计算机:理论方法
CiteScore
2.60
自引率
18.20%
发文量
471
审稿时长
12.6 months
期刊介绍: Theoretical Computer Science is mathematical and abstract in spirit, but it derives its motivation from practical and everyday computation. Its aim is to understand the nature of computation and, as a consequence of this understanding, provide more efficient methodologies. All papers introducing or studying mathematical, logic and formal concepts and methods are welcome, provided that their motivation is clearly drawn from the field of computing.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信