生成后缀数组和Burrows-Wheeler变换的快速算法

Proceedings DCC '98 Data Compression Conference (Cat. No.98TB100225) Pub Date : 1998-03-30 DOI:10.1109/DCC.1998.672139

K. Sadakane

{"title":"生成后缀数组和Burrows-Wheeler变换的快速算法","authors":"K. Sadakane","doi":"10.1109/DCC.1998.672139","DOIUrl":null,"url":null,"abstract":"We propose a fast and memory efficient algorithm for sorting suffixes of a text in lexicographic order. It is important to sort suffixes because an array of indexes of suffixes is called a suffix array and it is a memory efficient alternative of the suffix tree. Sorting suffixes is also used for the Burrows-Wheeler (see Technical Report 124, Digital SRC Research Report, 1994) transformation in the block sorting text compression, therefore fast sorting algorithms are desired. We compare algorithms for making suffix arrays of Bentley-Sedgewick (see Proceedings of the 8th Annual ACM-SIAM Symposium on Discrete Algorithms, p.360-9, 1997), Andersson-Nilsson (see 35th Symp. on Foundations of Computer Science, p.714-21, 1994) and Karp-Miller-Rosenberg (1972) and making suffix trees of Larsson (see Data Compression Conference, p.190-9, 1996) on the speed and required memory and propose a new algorithm which is fast and memory efficient by combining them. We also define a measure of difficulty of sorting suffixes: average match length. Our algorithm is effective when the average match length of a text is large, especially for large databases.","PeriodicalId":191890,"journal":{"name":"Proceedings DCC '98 Data Compression Conference (Cat. No.98TB100225)","volume":"29 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"1998-03-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"74","resultStr":"{\"title\":\"A fast algorithm for making suffix arrays and for Burrows-Wheeler transformation\",\"authors\":\"K. Sadakane\",\"doi\":\"10.1109/DCC.1998.672139\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"We propose a fast and memory efficient algorithm for sorting suffixes of a text in lexicographic order. It is important to sort suffixes because an array of indexes of suffixes is called a suffix array and it is a memory efficient alternative of the suffix tree. Sorting suffixes is also used for the Burrows-Wheeler (see Technical Report 124, Digital SRC Research Report, 1994) transformation in the block sorting text compression, therefore fast sorting algorithms are desired. We compare algorithms for making suffix arrays of Bentley-Sedgewick (see Proceedings of the 8th Annual ACM-SIAM Symposium on Discrete Algorithms, p.360-9, 1997), Andersson-Nilsson (see 35th Symp. on Foundations of Computer Science, p.714-21, 1994) and Karp-Miller-Rosenberg (1972) and making suffix trees of Larsson (see Data Compression Conference, p.190-9, 1996) on the speed and required memory and propose a new algorithm which is fast and memory efficient by combining them. We also define a measure of difficulty of sorting suffixes: average match length. Our algorithm is effective when the average match length of a text is large, especially for large databases.\",\"PeriodicalId\":191890,\"journal\":{\"name\":\"Proceedings DCC '98 Data Compression Conference (Cat. No.98TB100225)\",\"volume\":\"29 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"1998-03-30\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"74\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings DCC '98 Data Compression Conference (Cat. No.98TB100225)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/DCC.1998.672139\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings DCC '98 Data Compression Conference (Cat. No.98TB100225)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/DCC.1998.672139","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 74

摘要

我们提出了一种快速且内存高效的按字典顺序排序文本后缀的算法。对后缀进行排序很重要，因为后缀索引数组称为后缀数组，它是后缀树的内存效率替代方案。排序后缀也用于Burrows-Wheeler(见Technical Report 124, Digital SRC Research Report, 1994)块排序文本压缩中的转换，因此需要快速排序算法。我们比较了制作Bentley-Sedgewick后缀数组的算法(见第八届ACM-SIAM离散算法研讨会论文集，第360-9页，1997)，Andersson-Nilsson(见第35届研讨会)。on Foundations of Computer Science, p.714- 21,1994)和Karp-Miller-Rosenberg(1972)在速度和所需内存上制作Larsson的后缀树(参见Data Compression Conference, p.190- 9,1996)，并结合它们提出了一种快速和内存效率高的新算法。我们还定义了排序后缀的难度度量:平均匹配长度。当文本的平均匹配长度较大时，特别是对于大型数据库，我们的算法是有效的。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

A fast algorithm for making suffix arrays and for Burrows-Wheeler transformation

We propose a fast and memory efficient algorithm for sorting suffixes of a text in lexicographic order. It is important to sort suffixes because an array of indexes of suffixes is called a suffix array and it is a memory efficient alternative of the suffix tree. Sorting suffixes is also used for the Burrows-Wheeler (see Technical Report 124, Digital SRC Research Report, 1994) transformation in the block sorting text compression, therefore fast sorting algorithms are desired. We compare algorithms for making suffix arrays of Bentley-Sedgewick (see Proceedings of the 8th Annual ACM-SIAM Symposium on Discrete Algorithms, p.360-9, 1997), Andersson-Nilsson (see 35th Symp. on Foundations of Computer Science, p.714-21, 1994) and Karp-Miller-Rosenberg (1972) and making suffix trees of Larsson (see Data Compression Conference, p.190-9, 1996) on the speed and required memory and propose a new algorithm which is fast and memory efficient by combining them. We also define a measure of difficulty of sorting suffixes: average match length. Our algorithm is effective when the average match length of a text is large, especially for large databases.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Proceedings DCC '98 Data Compression Conference (Cat. No.98TB100225)

自引率

0.00%

发文量