Hongwei Huo, Xiaoyang Chen, Yuhao Zhao, Xiaojin Zhu, J. Vitter
{"title":"Practical Succinct Text Indexes in External Memory","authors":"Hongwei Huo, Xiaoyang Chen, Yuhao Zhao, Xiaojin Zhu, J. Vitter","doi":"10.1109/DCC.2018.00030","DOIUrl":null,"url":null,"abstract":"Chien et al. [1, 2] introduced the geometric Burrows-Wheeler transform (GBWT) as the first succinct text index for I/O-efficient pattern matching in external memory; it operates by transforming a text T into point set S in the two-dimensional plane. In this paper we introduce a practical succinct external memory text index, called mKD-GBWT. We partition S into ς2 subregions by partitioning the x-axis into ς intervals using the suffix ranges of characters of T and partitioning the y-axis into ς intervals using characters of T, where ς is the alphabet size of T. In this way, we can represent a point using fewer bits and perform a query in a reduced region so as to improve the space usage and I/Os of GBWT in practice. In addition, we plug a crit-bit tree into each node of string B-trees to represent variable-length strings stored. Experimental results show that mKD-GBWT provides significant improvement in space usage compared with the state-of-the-art indexing techniques. The source code is available online [3].","PeriodicalId":137206,"journal":{"name":"2018 Data Compression Conference","volume":"36 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2018-03-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2018 Data Compression Conference","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/DCC.2018.00030","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 1
Abstract
Chien et al. [1, 2] introduced the geometric Burrows-Wheeler transform (GBWT) as the first succinct text index for I/O-efficient pattern matching in external memory; it operates by transforming a text T into point set S in the two-dimensional plane. In this paper we introduce a practical succinct external memory text index, called mKD-GBWT. We partition S into ς2 subregions by partitioning the x-axis into ς intervals using the suffix ranges of characters of T and partitioning the y-axis into ς intervals using characters of T, where ς is the alphabet size of T. In this way, we can represent a point using fewer bits and perform a query in a reduced region so as to improve the space usage and I/Os of GBWT in practice. In addition, we plug a crit-bit tree into each node of string B-trees to represent variable-length strings stored. Experimental results show that mKD-GBWT provides significant improvement in space usage compared with the state-of-the-art indexing techniques. The source code is available online [3].