一种在内存中构造后缀数组的有效方法

6th International Symposium on String Processing and Information Retrieval. 5th International Workshop on Groupware (Cat. No.PR00268) Pub Date : 1999-09-21 DOI:10.1109/SPIRE.1999.796581

Hideo Itoh, Hozumi Tanaka

{"title":"一种在内存中构造后缀数组的有效方法","authors":"Hideo Itoh, Hozumi Tanaka","doi":"10.1109/SPIRE.1999.796581","DOIUrl":null,"url":null,"abstract":"The suffix array is a string-indexing structure and a memory efficient alternative to the suffix tree. It has many advantages for text processing. We propose an efficient algorithm for sorting suffixes. We call this algorithm the two-stage suffix sort. One of our ideas is to exploit the specific relationships between adjacent suffixes. Our algorithm makes it possible to use the suffix array for much larger texts and suggests new areas of application. Our experiments on several text data sets (including 514-MB Japanese newspapers) demonstrate that our algorithm is 4.5 to 6.9 times faster than Quicksort, and 2.5 to 3.6 times faster than K. Sadakane's (1998) algorithm, which is considered to be the fastest algorithm in previous work.","PeriodicalId":131279,"journal":{"name":"6th International Symposium on String Processing and Information Retrieval. 5th International Workshop on Groupware (Cat. No.PR00268)","volume":"10 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"1999-09-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"74","resultStr":"{\"title\":\"An efficient method for in memory construction of suffix arrays\",\"authors\":\"Hideo Itoh, Hozumi Tanaka\",\"doi\":\"10.1109/SPIRE.1999.796581\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"The suffix array is a string-indexing structure and a memory efficient alternative to the suffix tree. It has many advantages for text processing. We propose an efficient algorithm for sorting suffixes. We call this algorithm the two-stage suffix sort. One of our ideas is to exploit the specific relationships between adjacent suffixes. Our algorithm makes it possible to use the suffix array for much larger texts and suggests new areas of application. Our experiments on several text data sets (including 514-MB Japanese newspapers) demonstrate that our algorithm is 4.5 to 6.9 times faster than Quicksort, and 2.5 to 3.6 times faster than K. Sadakane's (1998) algorithm, which is considered to be the fastest algorithm in previous work.\",\"PeriodicalId\":131279,\"journal\":{\"name\":\"6th International Symposium on String Processing and Information Retrieval. 5th International Workshop on Groupware (Cat. No.PR00268)\",\"volume\":\"10 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"1999-09-21\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"74\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"6th International Symposium on String Processing and Information Retrieval. 5th International Workshop on Groupware (Cat. No.PR00268)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/SPIRE.1999.796581\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"6th International Symposium on String Processing and Information Retrieval. 5th International Workshop on Groupware (Cat. No.PR00268)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/SPIRE.1999.796581","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 74

摘要

后缀数组是一个字符串索引结构，是后缀树的内存效率替代方案。它在文本处理方面有许多优点。我们提出了一种高效的后缀排序算法。我们称这种算法为两阶段后缀排序。我们的想法之一是利用相邻后缀之间的特定关系。我们的算法使得对更大的文本使用后缀数组成为可能，并提出了新的应用领域。我们在几个文本数据集(包括514 mb的日本报纸)上的实验表明，我们的算法比Quicksort快4.5到6.9倍，比K. Sadakane(1998)的算法快2.5到3.6倍，该算法被认为是以前工作中最快的算法。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

An efficient method for in memory construction of suffix arrays

The suffix array is a string-indexing structure and a memory efficient alternative to the suffix tree. It has many advantages for text processing. We propose an efficient algorithm for sorting suffixes. We call this algorithm the two-stage suffix sort. One of our ideas is to exploit the specific relationships between adjacent suffixes. Our algorithm makes it possible to use the suffix array for much larger texts and suggests new areas of application. Our experiments on several text data sets (including 514-MB Japanese newspapers) demonstrate that our algorithm is 4.5 to 6.9 times faster than Quicksort, and 2.5 to 3.6 times faster than K. Sadakane's (1998) algorithm, which is considered to be the fastest algorithm in previous work.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

6th International Symposium on String Processing and Information Retrieval. 5th International Workshop on Groupware (Cat. No.PR00268)

自引率

0.00%

发文量