Generic Non-Recursive Suffix Array Construction

IF 0.9 3区计算机科学 Q3 COMPUTER SCIENCE, THEORY & METHODS

ACM Transactions on Algorithms Pub Date : 2024-02-08 DOI:10.1145/3641854

Jannik Olbrich, Enno Ohlebusch, Thomas Büchler

{"title":"Generic Non-Recursive Suffix Array Construction","authors":"Jannik Olbrich, Enno Ohlebusch, Thomas Büchler","doi":"10.1145/3641854","DOIUrl":null,"url":null,"abstract":"<p>The suffix array is arguably one of the most important data structures in sequence analysis and consequently there is a multitude of suffix sorting algorithms. However, to this date the <monospace>GSACA</monospace> algorithm introduced in 2015 is the only known non-recursive linear-time suffix array construction algorithm (SACA). Despite its interesting theoretical properties, there has been little effort in improving <monospace>GSACA</monospace>’s non-competitive real-world performance. There is a super-linear algorithm <monospace>DSH</monospace> which relies on the same sorting principle and is faster than <monospace>DivSufSort</monospace>, the fastest SACA for over a decade. The purpose of this paper is twofold: We analyse the sorting principle used in <monospace>GSACA</monospace> and <monospace>DSH</monospace> and exploit its properties in order to give an optimised linear-time algorithm, and we show that it can be very elegantly used to compute both the original extended Burrows-Wheeler transform (\\(\\mathsf {eBWT} \\)) and a bijective version of the Burrows-Wheeler transform (\\(\\mathsf {BBWT} \\)) in linear time. We call the algorithm “generic” since it can be used to compute the regular suffix array and the variants used for the \\(\\mathsf {BBWT} \\) and \\(\\mathsf {eBWT} \\). Our suffix array construction algorithm is not only significantly faster than <monospace>GSACA</monospace> but also outperforms <monospace>DivSufSort</monospace> and <monospace>DSH</monospace>. Our \\(\\mathsf {BBWT} \\)-algorithm is faster than or competitive with all other tested \\(\\mathsf {BBWT} \\) construction implementations on large or repetitive data, and our \\(\\mathsf {eBWT} \\)-algorithm is faster than all other programs on data that is not extremely repetitive.</p>","PeriodicalId":50922,"journal":{"name":"ACM Transactions on Algorithms","volume":"52 1","pages":""},"PeriodicalIF":0.9000,"publicationDate":"2024-02-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"ACM Transactions on Algorithms","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1145/3641854","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"COMPUTER SCIENCE, THEORY & METHODS","Score":null,"Total":0}

引用次数: 0

Abstract

The suffix array is arguably one of the most important data structures in sequence analysis and consequently there is a multitude of suffix sorting algorithms. However, to this date the GSACA algorithm introduced in 2015 is the only known non-recursive linear-time suffix array construction algorithm (SACA). Despite its interesting theoretical properties, there has been little effort in improving GSACA’s non-competitive real-world performance. There is a super-linear algorithm DSH which relies on the same sorting principle and is faster than DivSufSort, the fastest SACA for over a decade. The purpose of this paper is twofold: We analyse the sorting principle used in GSACA and DSH and exploit its properties in order to give an optimised linear-time algorithm, and we show that it can be very elegantly used to compute both the original extended Burrows-Wheeler transform (\(\mathsf {eBWT} \)) and a bijective version of the Burrows-Wheeler transform (\(\mathsf {BBWT} \)) in linear time. We call the algorithm “generic” since it can be used to compute the regular suffix array and the variants used for the \(\mathsf {BBWT} \) and \(\mathsf {eBWT} \). Our suffix array construction algorithm is not only significantly faster than GSACA but also outperforms DivSufSort and DSH. Our \(\mathsf {BBWT} \)-algorithm is faster than or competitive with all other tested \(\mathsf {BBWT} \) construction implementations on large or repetitive data, and our \(\mathsf {eBWT} \)-algorithm is faster than all other programs on data that is not extremely repetitive.

查看原文本刊更多论文

通用非递归后缀数组结构

后缀数组可以说是序列分析中最重要的数据结构之一，因此后缀排序算法也层出不穷。然而，迄今为止，2015 年推出的 GSACA 算法是唯一已知的非递归线性时间后缀数组构造算法（SACA）。尽管 GSACA 具有有趣的理论特性，但在提高其非竞争性的实际性能方面却鲜有建树。有一种超线性算法 DSH 基于相同的排序原理，比十多年来速度最快的 SACA DivSufSort 更快。本文的目的有两个：我们分析了GSACA和DSH中使用的排序原理，并利用它的特性给出了一个优化的线性时间算法，我们还证明了它可以非常优雅地用于在线性时间内计算原始的扩展Burrows-Wheeler变换（\(\mathsf {eBWT} \)）和Burrows-Wheeler变换的双射版本（\(\mathsf {BBWT} \)）。我们称这种算法为 "通用 "算法，因为它可以用来计算常规后缀数组，以及用于\(\mathsf {BBWT} \)和\(\mathsf {eBWT} \)的变体。我们的后缀数组构建算法不仅明显快于 GSACA，而且优于 DivSufSort 和 DSH。在大型或重复数据上，我们的（\mathsf {BBWT} \）算法比其他所有测试过的（\mathsf {BBWT} \）构造实现都要快，或者说具有竞争力，而在非（\mathsf {eBWT} \）重复数据上，我们的（\mathsf {eBWT} \）算法比其他所有程序都要快。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

ACM Transactions on Algorithms COMPUTER SCIENCE, THEORY & METHODS-MATHEMATICS, APPLIED

CiteScore

3.30

自引率

0.00%

发文量

审稿时长

6-12 weeks

期刊介绍： ACM Transactions on Algorithms welcomes submissions of original research of the highest quality dealing with algorithms that are inherently discrete and finite, and having mathematical content in a natural way, either in the objective or in the analysis. Most welcome are new algorithms and data structures, new and improved analyses, and complexity results. Specific areas of computation covered by the journal include combinatorial searches and objects; counting; discrete optimization and approximation; randomization and quantum computation; parallel and distributed computation; algorithms for graphs, geometry, arithmetic, number theory, strings; on-line analysis; cryptography; coding; data compression; learning algorithms; methods of algorithmic analysis; discrete algorithms for application areas such as biology, economics, game theory, communication, computer systems and architecture, hardware design, scientific computing