Diego Díaz-Domínguez, Saska Dönges, S. Puglisi, Leena Salmela
{"title":"Simple Runs-Bounded FM-Index Designs Are Fast","authors":"Diego Díaz-Domínguez, Saska Dönges, S. Puglisi, Leena Salmela","doi":"10.4230/LIPIcs.SEA.2023.7","DOIUrl":null,"url":null,"abstract":"Given a string X of length n on alphabet σ , the FM-index data structure allows counting all occurrences of a pattern P of length m in O ( m ) time via an algorithm called backward search . An important difficulty when searching with an FM-index is to support queries on L , the Burrows-Wheeler transform of X , while L is in compressed form. This problem has been the subject of intense research for 25 years now. Run-length encoding of L is an effective way to reduce index size, in particular when the data being indexed is highly-repetitive, which is the case in many types of modern data, including those arising from versioned document collections and in pangenomics. This paper takes a back-to-basics look at supporting backward search in FM-indexes, exploring and engineering two simple designs. The first divides the BWT string into blocks containing b symbols each and then run-length compresses each block separately, possibly introducing new runs (compared to applying run-length encoding once, to the whole string). Each block stores counts of each symbol that occurs before the block. This method supports the operation rank c ( L, i ) (i.e., count the number of times c occurs in the prefix L [1 ..i ]) by first determining the block i/b in which i falls and scanning the block to the appropriate position counting occurrences of c along the way. This partial answer to rank c ( L, i ) is then added to the stored count of c symbols before the block to determine the final answer. Our second design has a similar structure, but instead divides the run-length-encoded version of L into blocks containing an equal number of runs. The trick then is to determine the block in which a query falls, which is achieved via a predecessor query over the block starting positions. We show via extensive experiments on a wide range of repetitive text collections that these FM-indexes are not only easy to implement, but also fast and space efficient in practice.","PeriodicalId":9448,"journal":{"name":"Bulletin of the Society of Sea Water Science, Japan","volume":"1 1","pages":"7:1-7:16"},"PeriodicalIF":0.0000,"publicationDate":"2023-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Bulletin of the Society of Sea Water Science, Japan","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.4230/LIPIcs.SEA.2023.7","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 1
Abstract
Given a string X of length n on alphabet σ , the FM-index data structure allows counting all occurrences of a pattern P of length m in O ( m ) time via an algorithm called backward search . An important difficulty when searching with an FM-index is to support queries on L , the Burrows-Wheeler transform of X , while L is in compressed form. This problem has been the subject of intense research for 25 years now. Run-length encoding of L is an effective way to reduce index size, in particular when the data being indexed is highly-repetitive, which is the case in many types of modern data, including those arising from versioned document collections and in pangenomics. This paper takes a back-to-basics look at supporting backward search in FM-indexes, exploring and engineering two simple designs. The first divides the BWT string into blocks containing b symbols each and then run-length compresses each block separately, possibly introducing new runs (compared to applying run-length encoding once, to the whole string). Each block stores counts of each symbol that occurs before the block. This method supports the operation rank c ( L, i ) (i.e., count the number of times c occurs in the prefix L [1 ..i ]) by first determining the block i/b in which i falls and scanning the block to the appropriate position counting occurrences of c along the way. This partial answer to rank c ( L, i ) is then added to the stored count of c symbols before the block to determine the final answer. Our second design has a similar structure, but instead divides the run-length-encoded version of L into blocks containing an equal number of runs. The trick then is to determine the block in which a query falls, which is achieved via a predecessor query over the block starting positions. We show via extensive experiments on a wide range of repetitive text collections that these FM-indexes are not only easy to implement, but also fast and space efficient in practice.