Daniel Gibney , Paul MacNichol , Sharma V. Thankachan
{"title":"bwt运行有界空间中的非重叠索引","authors":"Daniel Gibney , Paul MacNichol , Sharma V. Thankachan","doi":"10.1016/j.tcs.2025.115512","DOIUrl":null,"url":null,"abstract":"<div><div>We revisit the non-overlapping indexing problem for an efficient repetition-aware solution. The problem is to index a text <span><math><mi>T</mi><mo>[</mo><mn>1</mn><mo>.</mo><mo>.</mo><mi>n</mi><mo>]</mo></math></span>, such that whenever a pattern <span><math><mi>P</mi><mo>[</mo><mn>1</mn><mo>.</mo><mo>.</mo><mi>p</mi><mo>]</mo></math></span> comes as a query, we can report the largest set of non-overlapping occurrences of <em>P</em> in <em>T</em>. A previous index by Cohen and Porat [ISAAC 2009] takes linear space and optimal <span><math><mi>O</mi><mo>(</mo><mi>p</mi><mo>+</mo><mrow><mi>oc</mi><msub><mrow><mi>c</mi></mrow><mrow><mi>no</mi></mrow></msub></mrow><mo>)</mo></math></span> query time, where <span><math><mi>oc</mi><msub><mrow><mi>c</mi></mrow><mrow><mi>no</mi></mrow></msub></math></span> denotes the output size. We present an index of size <span><math><mi>O</mi><mo>(</mo><mi>r</mi><mo>)</mo></math></span>, where <em>r</em> denotes the number of runs in the Burrows Wheeler Transform (BWT) of <em>T</em>. The parameter <em>r</em> is significantly smaller than <em>n</em> for highly repetitive texts. The query time of our index is <span><math><mi>O</mi><mo>(</mo><mi>p</mi><mi>log</mi><mo></mo><msub><mrow><mi>log</mi></mrow><mrow><mi>w</mi></mrow></msub><mo></mo><mi>σ</mi><mo>+</mo><mrow><mi>sort</mi></mrow><mo>(</mo><mrow><mi>oc</mi><msub><mrow><mi>c</mi></mrow><mrow><mi>no</mi></mrow></msub></mrow><mo>)</mo><mo>)</mo></math></span>, where <em>σ</em> denotes the alphabet size, <em>w</em> denotes the machine word size in bits and <span><math><mrow><mi>sort</mi></mrow><mo>(</mo><mi>x</mi><mo>)</mo></math></span> denotes the time for sorting <em>x</em> integers within the range <span><math><mo>[</mo><mn>1</mn><mo>,</mo><mi>n</mi><mo>]</mo></math></span>. We also study the counting version of this problem.</div></div>","PeriodicalId":49438,"journal":{"name":"Theoretical Computer Science","volume":"1056 ","pages":"Article 115512"},"PeriodicalIF":1.0000,"publicationDate":"2025-08-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Non-overlapping indexing in BWT-runs bounded space\",\"authors\":\"Daniel Gibney , Paul MacNichol , Sharma V. Thankachan\",\"doi\":\"10.1016/j.tcs.2025.115512\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><div>We revisit the non-overlapping indexing problem for an efficient repetition-aware solution. The problem is to index a text <span><math><mi>T</mi><mo>[</mo><mn>1</mn><mo>.</mo><mo>.</mo><mi>n</mi><mo>]</mo></math></span>, such that whenever a pattern <span><math><mi>P</mi><mo>[</mo><mn>1</mn><mo>.</mo><mo>.</mo><mi>p</mi><mo>]</mo></math></span> comes as a query, we can report the largest set of non-overlapping occurrences of <em>P</em> in <em>T</em>. A previous index by Cohen and Porat [ISAAC 2009] takes linear space and optimal <span><math><mi>O</mi><mo>(</mo><mi>p</mi><mo>+</mo><mrow><mi>oc</mi><msub><mrow><mi>c</mi></mrow><mrow><mi>no</mi></mrow></msub></mrow><mo>)</mo></math></span> query time, where <span><math><mi>oc</mi><msub><mrow><mi>c</mi></mrow><mrow><mi>no</mi></mrow></msub></math></span> denotes the output size. We present an index of size <span><math><mi>O</mi><mo>(</mo><mi>r</mi><mo>)</mo></math></span>, where <em>r</em> denotes the number of runs in the Burrows Wheeler Transform (BWT) of <em>T</em>. The parameter <em>r</em> is significantly smaller than <em>n</em> for highly repetitive texts. The query time of our index is <span><math><mi>O</mi><mo>(</mo><mi>p</mi><mi>log</mi><mo></mo><msub><mrow><mi>log</mi></mrow><mrow><mi>w</mi></mrow></msub><mo></mo><mi>σ</mi><mo>+</mo><mrow><mi>sort</mi></mrow><mo>(</mo><mrow><mi>oc</mi><msub><mrow><mi>c</mi></mrow><mrow><mi>no</mi></mrow></msub></mrow><mo>)</mo><mo>)</mo></math></span>, where <em>σ</em> denotes the alphabet size, <em>w</em> denotes the machine word size in bits and <span><math><mrow><mi>sort</mi></mrow><mo>(</mo><mi>x</mi><mo>)</mo></math></span> denotes the time for sorting <em>x</em> integers within the range <span><math><mo>[</mo><mn>1</mn><mo>,</mo><mi>n</mi><mo>]</mo></math></span>. We also study the counting version of this problem.</div></div>\",\"PeriodicalId\":49438,\"journal\":{\"name\":\"Theoretical Computer Science\",\"volume\":\"1056 \",\"pages\":\"Article 115512\"},\"PeriodicalIF\":1.0000,\"publicationDate\":\"2025-08-12\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Theoretical Computer Science\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S0304397525004505\",\"RegionNum\":4,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q3\",\"JCRName\":\"COMPUTER SCIENCE, THEORY & METHODS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Theoretical Computer Science","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0304397525004505","RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"COMPUTER SCIENCE, THEORY & METHODS","Score":null,"Total":0}
Non-overlapping indexing in BWT-runs bounded space
We revisit the non-overlapping indexing problem for an efficient repetition-aware solution. The problem is to index a text , such that whenever a pattern comes as a query, we can report the largest set of non-overlapping occurrences of P in T. A previous index by Cohen and Porat [ISAAC 2009] takes linear space and optimal query time, where denotes the output size. We present an index of size , where r denotes the number of runs in the Burrows Wheeler Transform (BWT) of T. The parameter r is significantly smaller than n for highly repetitive texts. The query time of our index is , where σ denotes the alphabet size, w denotes the machine word size in bits and denotes the time for sorting x integers within the range . We also study the counting version of this problem.
期刊介绍:
Theoretical Computer Science is mathematical and abstract in spirit, but it derives its motivation from practical and everyday computation. Its aim is to understand the nature of computation and, as a consequence of this understanding, provide more efficient methodologies. All papers introducing or studying mathematical, logic and formal concepts and methods are welcome, provided that their motivation is clearly drawn from the field of computing.