Fulgor: a fast and compact k-mer index for large-scale matching and color queries.

IF 1.5 4区生物学 Q4 BIOCHEMICAL RESEARCH METHODS

Algorithms for Molecular Biology Pub Date : 2024-01-22 DOI:10.1186/s13015-024-00251-9

Jason Fan, Jamshed Khan, Noor Pratap Singh, Giulio Ermanno Pibiri, Rob Patro

{"title":"Fulgor: a fast and compact k-mer index for large-scale matching and color queries.","authors":"Jason Fan, Jamshed Khan, Noor Pratap Singh, Giulio Ermanno Pibiri, Rob Patro","doi":"10.1186/s13015-024-00251-9","DOIUrl":null,"url":null,"abstract":"<p><p>The problem of sequence identification or matching-determining the subset of reference sequences from a given collection that are likely to contain a short, queried nucleotide sequence-is relevant for many important tasks in Computational Biology, such as metagenomics and pangenome analysis. Due to the complex nature of such analyses and the large scale of the reference collections a resource-efficient solution to this problem is of utmost importance. This poses the threefold challenge of representing the reference collection with a data structure that is efficient to query, has light memory usage, and scales well to large collections. To solve this problem, we describe an efficient colored de Bruijn graph index, arising as the combination of a k-mer dictionary with a compressed inverted index. The proposed index takes full advantage of the fact that unitigs in the colored compacted de Bruijn graph are monochromatic (i.e., all k-mers in a unitig have the same set of references of origin, or color). Specifically, the unitigs are kept in the dictionary in color order, thereby allowing for the encoding of the map from k-mers to their colors in as little as 1 + o(1) bits per unitig. Hence, one color per unitig is stored in the index with almost no space/time overhead. By combining this property with simple but effective compression methods for integer lists, the index achieves very small space. We implement these methods in a tool called Fulgor, and conduct an extensive experimental analysis to demonstrate the improvement of our tool over previous solutions. For example, compared to Themisto-the strongest competitor in terms of index space vs. query time trade-off-Fulgor requires significantly less space (up to 43% less space for a collection of 150,000 Salmonella enterica genomes), is at least twice as fast for color queries, and is 2-6[Formula: see text] faster to construct.</p>","PeriodicalId":50823,"journal":{"name":"Algorithms for Molecular Biology","volume":"19 1","pages":"3"},"PeriodicalIF":1.5000,"publicationDate":"2024-01-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10810250/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Algorithms for Molecular Biology","FirstCategoryId":"99","ListUrlMain":"https://doi.org/10.1186/s13015-024-00251-9","RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q4","JCRName":"BIOCHEMICAL RESEARCH METHODS","Score":null,"Total":0}

引用次数: 0

Abstract

The problem of sequence identification or matching-determining the subset of reference sequences from a given collection that are likely to contain a short, queried nucleotide sequence-is relevant for many important tasks in Computational Biology, such as metagenomics and pangenome analysis. Due to the complex nature of such analyses and the large scale of the reference collections a resource-efficient solution to this problem is of utmost importance. This poses the threefold challenge of representing the reference collection with a data structure that is efficient to query, has light memory usage, and scales well to large collections. To solve this problem, we describe an efficient colored de Bruijn graph index, arising as the combination of a k-mer dictionary with a compressed inverted index. The proposed index takes full advantage of the fact that unitigs in the colored compacted de Bruijn graph are monochromatic (i.e., all k-mers in a unitig have the same set of references of origin, or color). Specifically, the unitigs are kept in the dictionary in color order, thereby allowing for the encoding of the map from k-mers to their colors in as little as 1 + o(1) bits per unitig. Hence, one color per unitig is stored in the index with almost no space/time overhead. By combining this property with simple but effective compression methods for integer lists, the index achieves very small space. We implement these methods in a tool called Fulgor, and conduct an extensive experimental analysis to demonstrate the improvement of our tool over previous solutions. For example, compared to Themisto-the strongest competitor in terms of index space vs. query time trade-off-Fulgor requires significantly less space (up to 43% less space for a collection of 150,000 Salmonella enterica genomes), is at least twice as fast for color queries, and is 2-6[Formula: see text] faster to construct.

查看原文本刊更多论文

Fulgor：用于大规模匹配和颜色查询的快速紧凑型 k-mer 索引。

序列识别或匹配问题--从给定的参考文献库中确定可能包含短核苷酸序列的参考序列子集--与计算生物学中的许多重要任务相关，如元基因组学和泛基因组分析。由于此类分析的复杂性和参考文献库的庞大规模，解决这一问题的资源效率解决方案至关重要。这就提出了三方面的挑战：用一种查询效率高、内存使用少、可扩展到大型参考文献集的数据结构来表示参考文献集。为了解决这个问题，我们描述了一种高效的彩色 de Bruijn 图索引，它是 k-mer 字典与压缩倒排索引的结合。所提出的索引充分利用了彩色压缩 de Bruijn 图中的单元格是单色的这一事实（即单元格中的所有 k-mer 都有相同的来源参考集或颜色）。具体来说，字典中的单元格是按颜色顺序排列的，因此每个单元格只需 1 + o(1) 比特就能完成从 k-mers 到其颜色的映射编码。因此，索引中每个单元格只存储一种颜色，几乎没有空间/时间开销。通过将这一特性与简单而有效的整数列表压缩方法相结合，索引实现了非常小的空间。我们在名为 Fulgor 的工具中实现了这些方法，并进行了广泛的实验分析，以证明我们的工具比以前的解决方案有所改进。例如，与索引空间与查询时间权衡方面最强劲的竞争对手 Themisto 相比，Fulgor 所需的空间大大减少（对于 15 万个肠炎沙门氏菌基因组集合而言，空间最多可减少 43%），对于彩色查询而言，速度至少快两倍，而且构建速度快 2-6[公式：见正文]。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Algorithms for Molecular Biology 生物-生化研究方法

CiteScore

2.40

自引率

10.00%

发文量

审稿时长

>12 weeks

期刊介绍： Algorithms for Molecular Biology publishes articles on novel algorithms for biological sequence and structure analysis, phylogeny reconstruction, and combinatorial algorithms and machine learning. Areas of interest include but are not limited to: algorithms for RNA and protein structure analysis, gene prediction and genome analysis, comparative sequence analysis and alignment, phylogeny, gene expression, machine learning, and combinatorial algorithms. Where appropriate, manuscripts should describe applications to real-world data. However, pure algorithm papers are also welcome if future applications to biological data are to be expected, or if they address complexity or approximation issues of novel computational problems in molecular biology. Articles about novel software tools will be considered for publication if they contain some algorithmically interesting aspects.