External memory BWT and LCP computation for sequence collections with applications.

IF 1.5 4区生物学 Q4 BIOCHEMICAL RESEARCH METHODS

Algorithms for Molecular Biology Pub Date : 2019-03-08 eCollection Date: 2019-01-01 DOI:10.1186/s13015-019-0140-0

Lavinia Egidi, Felipe A Louza, Giovanni Manzini, Guilherme P Telles

{"title":"External memory BWT and LCP computation for sequence collections with applications.","authors":"Lavinia Egidi, Felipe A Louza, Giovanni Manzini, Guilherme P Telles","doi":"10.1186/s13015-019-0140-0","DOIUrl":null,"url":null,"abstract":"Background: Sequencing technologies produce larger and larger collections of biosequences that have to be stored in compressed indices supporting fast search operations. Many compressed indices are based on the Burrows-Wheeler Transform (BWT) and the longest common prefix (LCP) array. Because of the sheer size of the input it is important to build these data structures in external memory and time using in the best possible way the available RAM.Results: We propose a space-efficient algorithm to compute the BWT and LCP array for a collection of sequences in the external or semi-external memory setting. Our algorithm splits the input collection into subcollections sufficiently small that it can compute their BWT in RAM using an optimal linear time algorithm. Next, it merges the partial BWTs in external or semi-external memory and in the process it also computes the LCP values. Our algorithm can be modified to output two additional arrays that, combined with the BWT and LCP array, provide simple, scan-based, external memory algorithms for three well known problems in bioinformatics: the computation of maximal repeats, the all pairs suffix-prefix overlaps, and the construction of succinct de Bruijn graphs.Conclusions: We prove that our algorithm performs <math><mrow><mi>O</mi> <mo>(</mo> <mi>n</mi> <mspace></mspace> <mi>maxlcp</mi> <mo>)</mo></mrow> </math> sequential I/Os, where n is the total length of the collection and <math><mi>maxlcp</mi></math> is the maximum LCP value. The experimental results show that our algorithm is only slightly slower than the state of the art for short sequences but it is up to 40 times faster for longer sequences or when the available RAM is at least equal to the size of the input.","PeriodicalId":50823,"journal":{"name":"Algorithms for Molecular Biology","volume":" ","pages":"6"},"PeriodicalIF":1.5000,"publicationDate":"2019-03-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1186/s13015-019-0140-0","citationCount":"30","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Algorithms for Molecular Biology","FirstCategoryId":"99","ListUrlMain":"https://doi.org/10.1186/s13015-019-0140-0","RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2019/1/1 0:00:00","PubModel":"eCollection","JCR":"Q4","JCRName":"BIOCHEMICAL RESEARCH METHODS","Score":null,"Total":0}

引用次数: 30

Abstract

Background: Sequencing technologies produce larger and larger collections of biosequences that have to be stored in compressed indices supporting fast search operations. Many compressed indices are based on the Burrows-Wheeler Transform (BWT) and the longest common prefix (LCP) array. Because of the sheer size of the input it is important to build these data structures in external memory and time using in the best possible way the available RAM.

Results: We propose a space-efficient algorithm to compute the BWT and LCP array for a collection of sequences in the external or semi-external memory setting. Our algorithm splits the input collection into subcollections sufficiently small that it can compute their BWT in RAM using an optimal linear time algorithm. Next, it merges the partial BWTs in external or semi-external memory and in the process it also computes the LCP values. Our algorithm can be modified to output two additional arrays that, combined with the BWT and LCP array, provide simple, scan-based, external memory algorithms for three well known problems in bioinformatics: the computation of maximal repeats, the all pairs suffix-prefix overlaps, and the construction of succinct de Bruijn graphs.

Conclusions: We prove that our algorithm performs $O (n maxlcp)$ sequential I/Os, where n is the total length of the collection and $maxlcp$ is the maximum LCP value. The experimental results show that our algorithm is only slightly slower than the state of the art for short sequences but it is up to 40 times faster for longer sequences or when the available RAM is at least equal to the size of the input.

Abstract Image

查看原文本刊更多论文

应用程序序列集合的外部内存BWT和LCP计算。

背景:测序技术产生越来越大的生物序列集合，必须存储在压缩索引中，以支持快速搜索操作。许多压缩索引都是基于Burrows-Wheeler变换(BWT)和最长公共前缀(LCP)数组。由于输入的绝对大小，因此在外部内存中构建这些数据结构非常重要，并且尽可能以最佳方式使用可用的RAM。结果:我们提出了一种空间效率高的算法来计算外部或半外部存储器设置中序列集合的BWT和LCP数组。我们的算法将输入集合分割成足够小的子集合，它可以使用最优线性时间算法在RAM中计算它们的BWT。接下来，它将部分bwt合并到外部或半外部存储器中，并在此过程中计算LCP值。我们的算法可以修改为输出两个额外的阵列，结合BWT和LCP阵列，为生物信息学中的三个众所周知的问题提供简单的，基于扫描的外部存储算法:最大重复的计算，所有对后缀-前缀重叠，以及简洁的de Bruijn图的构建。结论:我们证明了我们的算法执行O (n maxlcp)个顺序I/O，其中n为集合的总长度，maxlcp为最大LCP值。实验结果表明，对于短序列，我们的算法只比目前的技术稍慢，但对于较长的序列或当可用RAM至少等于输入大小时，它的速度可达40倍。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Algorithms for Molecular Biology 生物-生化研究方法

CiteScore

2.40

自引率

10.00%

发文量

审稿时长

>12 weeks

期刊介绍： Algorithms for Molecular Biology publishes articles on novel algorithms for biological sequence and structure analysis, phylogeny reconstruction, and combinatorial algorithms and machine learning. Areas of interest include but are not limited to: algorithms for RNA and protein structure analysis, gene prediction and genome analysis, comparative sequence analysis and alignment, phylogeny, gene expression, machine learning, and combinatorial algorithms. Where appropriate, manuscripts should describe applications to real-world data. However, pure algorithm papers are also welcome if future applications to biological data are to be expected, or if they address complexity or approximation issues of novel computational problems in molecular biology. Articles about novel software tools will be considered for publication if they contain some algorithmically interesting aspects.