An efficient method for in memory construction of suffix arrays

Hideo Itoh, Hozumi Tanaka
{"title":"An efficient method for in memory construction of suffix arrays","authors":"Hideo Itoh, Hozumi Tanaka","doi":"10.1109/SPIRE.1999.796581","DOIUrl":null,"url":null,"abstract":"The suffix array is a string-indexing structure and a memory efficient alternative to the suffix tree. It has many advantages for text processing. We propose an efficient algorithm for sorting suffixes. We call this algorithm the two-stage suffix sort. One of our ideas is to exploit the specific relationships between adjacent suffixes. Our algorithm makes it possible to use the suffix array for much larger texts and suggests new areas of application. Our experiments on several text data sets (including 514-MB Japanese newspapers) demonstrate that our algorithm is 4.5 to 6.9 times faster than Quicksort, and 2.5 to 3.6 times faster than K. Sadakane's (1998) algorithm, which is considered to be the fastest algorithm in previous work.","PeriodicalId":131279,"journal":{"name":"6th International Symposium on String Processing and Information Retrieval. 5th International Workshop on Groupware (Cat. No.PR00268)","volume":"10 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"1999-09-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"74","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"6th International Symposium on String Processing and Information Retrieval. 5th International Workshop on Groupware (Cat. No.PR00268)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/SPIRE.1999.796581","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 74

Abstract

The suffix array is a string-indexing structure and a memory efficient alternative to the suffix tree. It has many advantages for text processing. We propose an efficient algorithm for sorting suffixes. We call this algorithm the two-stage suffix sort. One of our ideas is to exploit the specific relationships between adjacent suffixes. Our algorithm makes it possible to use the suffix array for much larger texts and suggests new areas of application. Our experiments on several text data sets (including 514-MB Japanese newspapers) demonstrate that our algorithm is 4.5 to 6.9 times faster than Quicksort, and 2.5 to 3.6 times faster than K. Sadakane's (1998) algorithm, which is considered to be the fastest algorithm in previous work.
一种在内存中构造后缀数组的有效方法
后缀数组是一个字符串索引结构,是后缀树的内存效率替代方案。它在文本处理方面有许多优点。我们提出了一种高效的后缀排序算法。我们称这种算法为两阶段后缀排序。我们的想法之一是利用相邻后缀之间的特定关系。我们的算法使得对更大的文本使用后缀数组成为可能,并提出了新的应用领域。我们在几个文本数据集(包括514 mb的日本报纸)上的实验表明,我们的算法比Quicksort快4.5到6.9倍,比K. Sadakane(1998)的算法快2.5到3.6倍,该算法被认为是以前工作中最快的算法。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信