{"title":"Burrows-Wheeler变换的空间高效计算","authors":"José Fuentes-Sepúlveda, G. Navarro, Yakov Nekrich","doi":"10.1109/DCC.2019.00021","DOIUrl":null,"url":null,"abstract":"The Burrows-Wheeler Transform (BWT) has become an essential tool for compressed text indexing. Computing it efficiently and within little space is essential for the practicality of the indexes that build on it. A recent algorithm (Munro, Navarro & Nekrich, SODA 2017) computes the BWT in O(n) time using O(nlgσ) bits of space for a text of length n over an alphabet of size σ. The result is of theoretical nature and its practicality is far from obvious. In this paper we engineer their solution and show that, while a basic implementation is slow in practice, the algorithm is amenable to parallelization. For a wide range of alphabet sizes, our resulting implementation outperforms all the compact constructions in the space/time tradeoff map. On the smallest alphabets we are outperformed in time, but nevertheless achieve the least space within reasonable time. For example, in DNA sequences, the most widely used application of BWTs, our construction uses 4.84 bits per base and builds the BWT at a rate of 2.13 megabases per second, whereas the closest previous alternative uses around 7.09 bits per base and runs at 4.17 megabases per second.","PeriodicalId":167723,"journal":{"name":"2019 Data Compression Conference (DCC)","volume":"66 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2019-03-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"5","resultStr":"{\"title\":\"Space-Efficient Computation of the Burrows-Wheeler Transform\",\"authors\":\"José Fuentes-Sepúlveda, G. Navarro, Yakov Nekrich\",\"doi\":\"10.1109/DCC.2019.00021\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"The Burrows-Wheeler Transform (BWT) has become an essential tool for compressed text indexing. Computing it efficiently and within little space is essential for the practicality of the indexes that build on it. A recent algorithm (Munro, Navarro & Nekrich, SODA 2017) computes the BWT in O(n) time using O(nlgσ) bits of space for a text of length n over an alphabet of size σ. The result is of theoretical nature and its practicality is far from obvious. In this paper we engineer their solution and show that, while a basic implementation is slow in practice, the algorithm is amenable to parallelization. For a wide range of alphabet sizes, our resulting implementation outperforms all the compact constructions in the space/time tradeoff map. On the smallest alphabets we are outperformed in time, but nevertheless achieve the least space within reasonable time. For example, in DNA sequences, the most widely used application of BWTs, our construction uses 4.84 bits per base and builds the BWT at a rate of 2.13 megabases per second, whereas the closest previous alternative uses around 7.09 bits per base and runs at 4.17 megabases per second.\",\"PeriodicalId\":167723,\"journal\":{\"name\":\"2019 Data Compression Conference (DCC)\",\"volume\":\"66 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2019-03-26\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"5\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2019 Data Compression Conference (DCC)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/DCC.2019.00021\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2019 Data Compression Conference (DCC)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/DCC.2019.00021","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 5
摘要
Burrows-Wheeler变换(BWT)已成为压缩文本索引的重要工具。在很小的空间内高效地计算它对于建立在它之上的索引的实用性至关重要。最近的一种算法(Munro, Navarro & Nekrich, SODA 2017)使用O(nlgσ)位空间在O(n)时间内计算长度为n的文本在大小为σ的字母表上的BWT。结果是理论性的,实用性还不明显。在本文中,我们设计了他们的解决方案,并表明,虽然一个基本的实现在实践中很慢,但该算法是适合并行化的。对于大范围的字母大小,我们的最终实现优于空间/时间权衡图中的所有紧凑结构。在最小的字母上,我们在时间上表现得更好,但在合理的时间内获得了最小的空间。例如,在DNA序列(BWT最广泛使用的应用)中,我们的构建使用每个碱基4.84比特,以每秒2.13兆碱基的速率构建BWT,而之前最接近的替代方法使用每个碱基7.09比特,以每秒4.17兆碱基的速率运行。
Space-Efficient Computation of the Burrows-Wheeler Transform
The Burrows-Wheeler Transform (BWT) has become an essential tool for compressed text indexing. Computing it efficiently and within little space is essential for the practicality of the indexes that build on it. A recent algorithm (Munro, Navarro & Nekrich, SODA 2017) computes the BWT in O(n) time using O(nlgσ) bits of space for a text of length n over an alphabet of size σ. The result is of theoretical nature and its practicality is far from obvious. In this paper we engineer their solution and show that, while a basic implementation is slow in practice, the algorithm is amenable to parallelization. For a wide range of alphabet sizes, our resulting implementation outperforms all the compact constructions in the space/time tradeoff map. On the smallest alphabets we are outperformed in time, but nevertheless achieve the least space within reasonable time. For example, in DNA sequences, the most widely used application of BWTs, our construction uses 4.84 bits per base and builds the BWT at a rate of 2.13 megabases per second, whereas the closest previous alternative uses around 7.09 bits per base and runs at 4.17 megabases per second.