Simple Runs-Bounded FM-Index Designs Are Fast

Bulletin of the Society of Sea Water Science, Japan Pub Date : 2023-01-01 DOI:10.4230/LIPIcs.SEA.2023.7

Diego Díaz-Domínguez, Saska Dönges, S. Puglisi, Leena Salmela

{"title":"Simple Runs-Bounded FM-Index Designs Are Fast","authors":"Diego Díaz-Domínguez, Saska Dönges, S. Puglisi, Leena Salmela","doi":"10.4230/LIPIcs.SEA.2023.7","DOIUrl":null,"url":null,"abstract":"Given a string X of length n on alphabet σ , the FM-index data structure allows counting all occurrences of a pattern P of length m in O ( m ) time via an algorithm called backward search . An important difficulty when searching with an FM-index is to support queries on L , the Burrows-Wheeler transform of X , while L is in compressed form. This problem has been the subject of intense research for 25 years now. Run-length encoding of L is an effective way to reduce index size, in particular when the data being indexed is highly-repetitive, which is the case in many types of modern data, including those arising from versioned document collections and in pangenomics. This paper takes a back-to-basics look at supporting backward search in FM-indexes, exploring and engineering two simple designs. The first divides the BWT string into blocks containing b symbols each and then run-length compresses each block separately, possibly introducing new runs (compared to applying run-length encoding once, to the whole string). Each block stores counts of each symbol that occurs before the block. This method supports the operation rank c ( L, i ) (i.e., count the number of times c occurs in the prefix L [1 ..i ]) by first determining the block i/b in which i falls and scanning the block to the appropriate position counting occurrences of c along the way. This partial answer to rank c ( L, i ) is then added to the stored count of c symbols before the block to determine the final answer. Our second design has a similar structure, but instead divides the run-length-encoded version of L into blocks containing an equal number of runs. The trick then is to determine the block in which a query falls, which is achieved via a predecessor query over the block starting positions. We show via extensive experiments on a wide range of repetitive text collections that these FM-indexes are not only easy to implement, but also fast and space efficient in practice.","PeriodicalId":9448,"journal":{"name":"Bulletin of the Society of Sea Water Science, Japan","volume":"1 1","pages":"7:1-7:16"},"PeriodicalIF":0.0000,"publicationDate":"2023-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Bulletin of the Society of Sea Water Science, Japan","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.4230/LIPIcs.SEA.2023.7","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 1

Abstract

Given a string X of length n on alphabet σ , the FM-index data structure allows counting all occurrences of a pattern P of length m in O ( m ) time via an algorithm called backward search . An important difficulty when searching with an FM-index is to support queries on L , the Burrows-Wheeler transform of X , while L is in compressed form. This problem has been the subject of intense research for 25 years now. Run-length encoding of L is an effective way to reduce index size, in particular when the data being indexed is highly-repetitive, which is the case in many types of modern data, including those arising from versioned document collections and in pangenomics. This paper takes a back-to-basics look at supporting backward search in FM-indexes, exploring and engineering two simple designs. The first divides the BWT string into blocks containing b symbols each and then run-length compresses each block separately, possibly introducing new runs (compared to applying run-length encoding once, to the whole string). Each block stores counts of each symbol that occurs before the block. This method supports the operation rank c ( L, i ) (i.e., count the number of times c occurs in the prefix L [1 ..i ]) by first determining the block i/b in which i falls and scanning the block to the appropriate position counting occurrences of c along the way. This partial answer to rank c ( L, i ) is then added to the stored count of c symbols before the block to determine the final answer. Our second design has a similar structure, but instead divides the run-length-encoded version of L into blocks containing an equal number of runs. The trick then is to determine the block in which a query falls, which is achieved via a predecessor query over the block starting positions. We show via extensive experiments on a wide range of repetitive text collections that these FM-indexes are not only easy to implement, but also fast and space efficient in practice.

查看原文本刊更多论文

简单的运行-有限的fm -索引设计是快速的

给定字母σ上长度为n的字符串X, FM-index数据结构允许通过一种称为向后搜索的算法，在O (m)时间内计算长度为m的模式P的所有出现次数。当使用fm索引进行搜索时，一个重要的困难是支持L上的查询，即X的Burrows-Wheeler变换，而L是压缩形式。这个问题已经被深入研究了25年。L的运行长度编码是减少索引大小的有效方法，特别是当索引的数据高度重复时，这在许多类型的现代数据中都是如此，包括来自版本化文档集合和泛基因组学的数据。本文从根本上探讨了在fm索引中支持向后搜索，探索和设计了两个简单的设计。第一种方法是将BWT字符串分成每个包含b个符号的块，然后分别对每个块进行运行长度压缩，可能会引入新的运行(与对整个字符串应用一次运行长度编码相比)。每个块存储在该块之前出现的每个符号的计数。此方法支持操作秩c (L, i)(即，计数c在前缀L[1 ..]中出现的次数)。I])，首先确定I所在的块I /b，并扫描块到适当的位置，一路上计数c的出现次数。然后，将c (L, i)的部分答案添加到块之前存储的c个符号的计数中，以确定最终答案。我们的第二个设计具有类似的结构，但将运行长度编码版本的L划分为包含相同运行次数的块。接下来的技巧是确定查询落在哪个块中，这是通过对块起始位置的前导查询实现的。我们通过对大量重复文本集合的大量实验表明，这些fm索引不仅易于实现，而且在实践中速度快，空间效率高。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Bulletin of the Society of Sea Water Science, Japan

自引率

0.00%

发文量