用Libsais直接构造稀疏后缀数组。

IF 3.3 3区生物学 Q2 BIOCHEMICAL RESEARCH METHODS

BMC Bioinformatics Pub Date : 2025-10-17 DOI:10.1186/s12859-025-06277-z

Simon Van de Vyver, Tibo Vande Moortele, Peter Dawyndt, Bart Mesuere, Pieter Verschaffelt

{"title":"用Libsais直接构造稀疏后缀数组。","authors":"Simon Van de Vyver, Tibo Vande Moortele, Peter Dawyndt, Bart Mesuere, Pieter Verschaffelt","doi":"10.1186/s12859-025-06277-z","DOIUrl":null,"url":null,"abstract":"Background: Pattern matching is a fundamental challenge in bioinformatics, especially in the fields of genomics, transcriptomics and proteomics. Efficient indexing structures, such as suffix arrays, are critical for searching large datasets. A sparse suffix array (SSA) retains only suffixes at every k-th position in the text, where k is the sparseness factor. While sparse suffix arrays offer significant memory savings compared to full suffix arrays, they typically still require the construction of a full suffix array prior to a sampling step, resulting in substantial memory overhead during the construction phase.Results: We present an alternative method to directly construct the sparse suffix array using a simple, yet powerful text encoding. This encoding reduces the input text length by grouping characters, thereby enabling direct SSA construction by extending the widely used Libsais library. This approach bypasses the need to construct a full suffix array, reducing memory usage and construction time by 50 to 75% when building a sparse suffix array with sparseness factor 3 or 4 for various nucleotide and amino acid datasets. Depending on the alphabet size, similar gains can be achieved for sparseness factors up to 8. For higher sparseness factors, comparable performance improvements can be obtained by constructing the SSA using a suitable divisor of the desired sparseness factor, followed by a subsampling step. The method is particularly effective for applications with small alphabets, such as a nucleotide or amino acid alphabet. An open-source implementation of this method is available on GitHub, enabling easy adoption for large-scale bioinformatics applications.Conclusions: We introduce an efficient method for the construction of sparse suffix arrays for large datasets. Central to this approach is the introduction of a simple text transformation, which then serves as input to Libsais. This method reduces the length of both the input text and the resulting suffix array by a factor of k, which improves execution time and memory usage significantly.","PeriodicalId":8958,"journal":{"name":"BMC Bioinformatics","volume":"26 1","pages":"252"},"PeriodicalIF":3.3000,"publicationDate":"2025-10-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12535041/pdf/","citationCount":"0","resultStr":"{\"title\":\"Direct construction of sparse suffix arrays with Libsais.\",\"authors\":\"Simon Van de Vyver, Tibo Vande Moortele, Peter Dawyndt, Bart Mesuere, Pieter Verschaffelt\",\"doi\":\"10.1186/s12859-025-06277-z\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Background: Pattern matching is a fundamental challenge in bioinformatics, especially in the fields of genomics, transcriptomics and proteomics. Efficient indexing structures, such as suffix arrays, are critical for searching large datasets. A sparse suffix array (SSA) retains only suffixes at every k-th position in the text, where k is the sparseness factor. While sparse suffix arrays offer significant memory savings compared to full suffix arrays, they typically still require the construction of a full suffix array prior to a sampling step, resulting in substantial memory overhead during the construction phase.Results: We present an alternative method to directly construct the sparse suffix array using a simple, yet powerful text encoding. This encoding reduces the input text length by grouping characters, thereby enabling direct SSA construction by extending the widely used Libsais library. This approach bypasses the need to construct a full suffix array, reducing memory usage and construction time by 50 to 75% when building a sparse suffix array with sparseness factor 3 or 4 for various nucleotide and amino acid datasets. Depending on the alphabet size, similar gains can be achieved for sparseness factors up to 8. For higher sparseness factors, comparable performance improvements can be obtained by constructing the SSA using a suitable divisor of the desired sparseness factor, followed by a subsampling step. The method is particularly effective for applications with small alphabets, such as a nucleotide or amino acid alphabet. An open-source implementation of this method is available on GitHub, enabling easy adoption for large-scale bioinformatics applications.Conclusions: We introduce an efficient method for the construction of sparse suffix arrays for large datasets. Central to this approach is the introduction of a simple text transformation, which then serves as input to Libsais. This method reduces the length of both the input text and the resulting suffix array by a factor of k, which improves execution time and memory usage significantly.\",\"PeriodicalId\":8958,\"journal\":{\"name\":\"BMC Bioinformatics\",\"volume\":\"26 1\",\"pages\":\"252\"},\"PeriodicalIF\":3.3000,\"publicationDate\":\"2025-10-17\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12535041/pdf/\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"BMC Bioinformatics\",\"FirstCategoryId\":\"99\",\"ListUrlMain\":\"https://doi.org/10.1186/s12859-025-06277-z\",\"RegionNum\":3,\"RegionCategory\":\"生物学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"BIOCHEMICAL RESEARCH METHODS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"BMC Bioinformatics","FirstCategoryId":"99","ListUrlMain":"https://doi.org/10.1186/s12859-025-06277-z","RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"BIOCHEMICAL RESEARCH METHODS","Score":null,"Total":0}

引用次数: 0

摘要

背景：模式匹配是生物信息学，特别是基因组学、转录组学和蛋白质组学领域的基本挑战。高效的索引结构，如后缀数组，对于搜索大型数据集至关重要。稀疏后缀数组（SSA）只保留文本中每k个位置的后缀，其中k是稀疏因子。虽然与完整后缀数组相比，稀疏后缀数组提供了显著的内存节省，但它们通常仍然需要在采样步骤之前构建完整后缀数组，从而在构建阶段产生大量内存开销。结果：我们提出了一种使用简单但功能强大的文本编码直接构建稀疏后缀数组的替代方法。这种编码通过对字符进行分组来减少输入文本的长度，从而通过扩展广泛使用的Libsais库来实现直接的SSA构造。这种方法绕过了构建完整后缀数组的需要，在为各种核苷酸和氨基酸数据集构建稀疏因子为3或4的稀疏后缀数组时，将内存使用和构建时间减少了50%到75%。根据字母大小的不同，对于高达8的稀疏因子也可以获得类似的增益。对于更高的稀疏因子，可以通过使用所需稀疏因子的合适除数构建SSA，然后进行子采样步骤，从而获得相当的性能改进。该方法对于小字母（如核苷酸或氨基酸字母）的应用特别有效。该方法的开源实现可以在GitHub上获得，可以轻松采用大规模生物信息学应用。结论：本文提出了一种高效的大型数据集稀疏后缀数组构建方法。这种方法的核心是引入一个简单的文本转换，然后作为Libsais的输入。该方法将输入文本和结果后缀数组的长度减少了k倍，从而显著改善了执行时间和内存使用。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Direct construction of sparse suffix arrays with Libsais.

Background: Pattern matching is a fundamental challenge in bioinformatics, especially in the fields of genomics, transcriptomics and proteomics. Efficient indexing structures, such as suffix arrays, are critical for searching large datasets. A sparse suffix array (SSA) retains only suffixes at every k-th position in the text, where k is the sparseness factor. While sparse suffix arrays offer significant memory savings compared to full suffix arrays, they typically still require the construction of a full suffix array prior to a sampling step, resulting in substantial memory overhead during the construction phase.

Results: We present an alternative method to directly construct the sparse suffix array using a simple, yet powerful text encoding. This encoding reduces the input text length by grouping characters, thereby enabling direct SSA construction by extending the widely used Libsais library. This approach bypasses the need to construct a full suffix array, reducing memory usage and construction time by 50 to 75% when building a sparse suffix array with sparseness factor 3 or 4 for various nucleotide and amino acid datasets. Depending on the alphabet size, similar gains can be achieved for sparseness factors up to 8. For higher sparseness factors, comparable performance improvements can be obtained by constructing the SSA using a suitable divisor of the desired sparseness factor, followed by a subsampling step. The method is particularly effective for applications with small alphabets, such as a nucleotide or amino acid alphabet. An open-source implementation of this method is available on GitHub, enabling easy adoption for large-scale bioinformatics applications.

Conclusions: We introduce an efficient method for the construction of sparse suffix arrays for large datasets. Central to this approach is the introduction of a simple text transformation, which then serves as input to Libsais. This method reduces the length of both the input text and the resulting suffix array by a factor of k, which improves execution time and memory usage significantly.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

BMC Bioinformatics 生物-生化研究方法

CiteScore

5.70

自引率

3.30%

发文量

506

审稿时长

4.3 months

期刊介绍： BMC Bioinformatics is an open access, peer-reviewed journal that considers articles on all aspects of the development, testing and novel application of computational and statistical methods for the modeling and analysis of all kinds of biological data, as well as other areas of computational biology. BMC Bioinformatics is part of the BMC series which publishes subject-specific journals focused on the needs of individual research communities across all areas of biology and medicine. We offer an efficient, fair and friendly peer review service, and are committed to publishing all sound science, provided that there is some advance in knowledge presented by the work.