{"title":"Generic Non-Recursive Suffix Array Construction","authors":"Jannik Olbrich, Enno Ohlebusch, Thomas Büchler","doi":"10.1145/3641854","DOIUrl":null,"url":null,"abstract":"<p>The suffix array is arguably one of the most important data structures in sequence analysis and consequently there is a multitude of suffix sorting algorithms. However, to this date the <monospace>GSACA</monospace> algorithm introduced in 2015 is the only known non-recursive linear-time suffix array construction algorithm (SACA). Despite its interesting theoretical properties, there has been little effort in improving <monospace>GSACA</monospace>’s non-competitive real-world performance. There is a super-linear algorithm <monospace>DSH</monospace> which relies on the same sorting principle and is faster than <monospace>DivSufSort</monospace>, the fastest SACA for over a decade. The purpose of this paper is twofold: We analyse the sorting principle used in <monospace>GSACA</monospace> and <monospace>DSH</monospace> and exploit its properties in order to give an optimised linear-time algorithm, and we show that it can be very elegantly used to compute both the original extended Burrows-Wheeler transform (\\(\\mathsf {eBWT} \\)) and a bijective version of the Burrows-Wheeler transform (\\(\\mathsf {BBWT} \\)) in linear time. We call the algorithm “generic” since it can be used to compute the regular suffix array and the variants used for the \\(\\mathsf {BBWT} \\) and \\(\\mathsf {eBWT} \\). Our suffix array construction algorithm is not only significantly faster than <monospace>GSACA</monospace> but also outperforms <monospace>DivSufSort</monospace> and <monospace>DSH</monospace>. Our \\(\\mathsf {BBWT} \\)-algorithm is faster than or competitive with all other tested \\(\\mathsf {BBWT} \\) construction implementations on large or repetitive data, and our \\(\\mathsf {eBWT} \\)-algorithm is faster than all other programs on data that is not extremely repetitive.</p>","PeriodicalId":50922,"journal":{"name":"ACM Transactions on Algorithms","volume":"52 1","pages":""},"PeriodicalIF":0.9000,"publicationDate":"2024-02-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"ACM Transactions on Algorithms","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1145/3641854","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"COMPUTER SCIENCE, THEORY & METHODS","Score":null,"Total":0}
引用次数: 0
Abstract
The suffix array is arguably one of the most important data structures in sequence analysis and consequently there is a multitude of suffix sorting algorithms. However, to this date the GSACA algorithm introduced in 2015 is the only known non-recursive linear-time suffix array construction algorithm (SACA). Despite its interesting theoretical properties, there has been little effort in improving GSACA’s non-competitive real-world performance. There is a super-linear algorithm DSH which relies on the same sorting principle and is faster than DivSufSort, the fastest SACA for over a decade. The purpose of this paper is twofold: We analyse the sorting principle used in GSACA and DSH and exploit its properties in order to give an optimised linear-time algorithm, and we show that it can be very elegantly used to compute both the original extended Burrows-Wheeler transform (\(\mathsf {eBWT} \)) and a bijective version of the Burrows-Wheeler transform (\(\mathsf {BBWT} \)) in linear time. We call the algorithm “generic” since it can be used to compute the regular suffix array and the variants used for the \(\mathsf {BBWT} \) and \(\mathsf {eBWT} \). Our suffix array construction algorithm is not only significantly faster than GSACA but also outperforms DivSufSort and DSH. Our \(\mathsf {BBWT} \)-algorithm is faster than or competitive with all other tested \(\mathsf {BBWT} \) construction implementations on large or repetitive data, and our \(\mathsf {eBWT} \)-algorithm is faster than all other programs on data that is not extremely repetitive.
期刊介绍:
ACM Transactions on Algorithms welcomes submissions of original research of the highest quality dealing with algorithms that are inherently discrete and finite, and having mathematical content in a natural way, either in the objective or in the analysis. Most welcome are new algorithms and data structures, new and improved analyses, and complexity results. Specific areas of computation covered by the journal include
combinatorial searches and objects;
counting;
discrete optimization and approximation;
randomization and quantum computation;
parallel and distributed computation;
algorithms for
graphs,
geometry,
arithmetic,
number theory,
strings;
on-line analysis;
cryptography;
coding;
data compression;
learning algorithms;
methods of algorithmic analysis;
discrete algorithms for application areas such as
biology,
economics,
game theory,
communication,
computer systems and architecture,
hardware design,
scientific computing