The open-closed mod-minimizer algorithm.

IF 1.7 4区生物学 Q4 BIOCHEMICAL RESEARCH METHODS

Algorithms for Molecular Biology Pub Date : 2025-03-17 DOI:10.1186/s13015-025-00270-0

Ragnar Groot Koerkamp, Daniel Liu, Giulio Ermanno Pibiri

{"title":"The open-closed mod-minimizer algorithm.","authors":"Ragnar Groot Koerkamp, Daniel Liu, Giulio Ermanno Pibiri","doi":"10.1186/s13015-025-00270-0","DOIUrl":null,"url":null,"abstract":"<p><p>Sampling algorithms that deterministically select a subset of <math><mi>k</mi></math> -mers are an important building block in bioinformatics applications. For example, they are used to index large textual collections, like DNA, and to compare sequences quickly. In such applications, a sampling algorithm is required to select one <math><mi>k</mi></math> -mer out of every window of w consecutive <math><mi>k</mi></math> -mers. The folklore and most used scheme is the random minimizer that selects the smallest <math><mi>k</mi></math> -mer in the window according to some random order. This scheme is remarkably simple and versatile, and has a density (expected fraction of selected <math><mi>k</mi></math> -mers) of <math><mrow><mn>2</mn> <mo>/</mo> <mo>(</mo> <mi>w</mi> <mo>+</mo> <mn>1</mn> <mo>)</mo></mrow> </math> . In practice, lower density leads to faster methods and smaller indexes, and it turns out that the random minimizer is not the best one can do. Indeed, some schemes are known to approach optimal density 1/w when <math><mrow><mi>k</mi> <mo>→</mo> <mi>∞</mi></mrow> </math> , like the recently introduced mod-minimizer (Groot Koerkamp and Pibiri, WABI 2024). In this work, we study methods that achieve low density when <math><mrow><mi>k</mi> <mo>≤</mo> <mi>w</mi></mrow> </math> . In this small-k regime, a practical method with provably better density than the random minimizer is the miniception (Zheng et al., Bioinformatics 2021). This method can be elegantly described as sampling the smallest closed sycnmer (Edgar, PeerJ 2021) in the window according to some random order. We show that extending the miniception to prefer sampling open syncmers yields much better density. This new method-the open-closed minimizer-offers improved density for small <math><mrow><mi>k</mi> <mo>≤</mo> <mi>w</mi></mrow> </math> while being as fast to compute as the random minimizer. Compared to methods based on decycling sets, that achieve very low density in the small-k regime, our method has comparable density while being computationally simpler and intuitive. Furthermore, we extend the mod-minimizer to improve density of any scheme that works well for small k to also work well when <math><mrow><mi>k</mi> <mo>></mo> <mi>w</mi></mrow> </math> is large. We hence obtain the open-closed mod-minimizer, a practical method that improves over the mod-minimizer for all k.</p>","PeriodicalId":50823,"journal":{"name":"Algorithms for Molecular Biology","volume":"20 1","pages":"4"},"PeriodicalIF":1.7000,"publicationDate":"2025-03-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11912762/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Algorithms for Molecular Biology","FirstCategoryId":"99","ListUrlMain":"https://doi.org/10.1186/s13015-025-00270-0","RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q4","JCRName":"BIOCHEMICAL RESEARCH METHODS","Score":null,"Total":0}

引用次数: 0

Abstract

Sampling algorithms that deterministically select a subset of $k$ -mers are an important building block in bioinformatics applications. For example, they are used to index large textual collections, like DNA, and to compare sequences quickly. In such applications, a sampling algorithm is required to select one $k$ -mer out of every window of w consecutive $k$ -mers. The folklore and most used scheme is the random minimizer that selects the smallest $k$ -mer in the window according to some random order. This scheme is remarkably simple and versatile, and has a density (expected fraction of selected $k$ -mers) of $2 / (w + 1)$ . In practice, lower density leads to faster methods and smaller indexes, and it turns out that the random minimizer is not the best one can do. Indeed, some schemes are known to approach optimal density 1/w when $k \to \infty$ , like the recently introduced mod-minimizer (Groot Koerkamp and Pibiri, WABI 2024). In this work, we study methods that achieve low density when $k \leq w$ . In this small-k regime, a practical method with provably better density than the random minimizer is the miniception (Zheng et al., Bioinformatics 2021). This method can be elegantly described as sampling the smallest closed sycnmer (Edgar, PeerJ 2021) in the window according to some random order. We show that extending the miniception to prefer sampling open syncmers yields much better density. This new method-the open-closed minimizer-offers improved density for small $k \leq w$ while being as fast to compute as the random minimizer. Compared to methods based on decycling sets, that achieve very low density in the small-k regime, our method has comparable density while being computationally simpler and intuitive. Furthermore, we extend the mod-minimizer to improve density of any scheme that works well for small k to also work well when $k > w$ is large. We hence obtain the open-closed mod-minimizer, a practical method that improves over the mod-minimizer for all k.

Abstract Image

查看原文本刊更多论文

开闭模最小化算法。

确定性地选择k -mers子集的采样算法是生物信息学应用中的重要组成部分。例如，它们用于为大型文本集合（如DNA）建立索引，并用于快速比较序列。在这样的应用中，需要一个采样算法从w个连续k -mer的每个窗口中选择一个k -mer。最流行和最常用的方案是随机最小化器，它根据随机顺序选择窗口中最小的k -mer。该方案非常简单和通用，其密度（所选k -mers的期望分数）为2 / (w + 1)。在实践中，较低的密度会导致更快的方法和更小的索引，并且事实证明随机最小化器并不是最好的方法。事实上，已知一些方案在k→∞时接近最优密度1/w，例如最近引入的模最小化器（Groot Koerkamp和Pibiri， WABI 2024）。在这项工作中，我们研究了k≤w时实现低密度的方法。在这个小k范围内，一个可证明比随机最小化器密度更好的实用方法是miniception （Zheng et al., Bioinformatics 2021）。这种方法可以优雅地描述为根据随机顺序对窗口中最小的封闭同步子（Edgar, PeerJ 2021）进行采样。我们证明了扩展miniception来选择采样开放的synsyners可以产生更好的密度。这种新方法——开闭最小化器——在k≤w时提供了改进的密度，同时与随机最小化器一样快速计算。与基于循环集的方法相比，在小k范围内密度非常低，我们的方法具有相当的密度，同时计算更简单和直观。此外，我们扩展了模型最小化器，以提高任何方案的密度，该方案适用于小k，也适用于大k b> w。因此，我们得到了开闭模最小器，这是一种实用的方法，对所有k的模最小器都有改进。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Algorithms for Molecular Biology 生物-生化研究方法

CiteScore

2.40

自引率

10.00%

发文量

审稿时长

>12 weeks

期刊介绍： Algorithms for Molecular Biology publishes articles on novel algorithms for biological sequence and structure analysis, phylogeny reconstruction, and combinatorial algorithms and machine learning. Areas of interest include but are not limited to: algorithms for RNA and protein structure analysis, gene prediction and genome analysis, comparative sequence analysis and alignment, phylogeny, gene expression, machine learning, and combinatorial algorithms. Where appropriate, manuscripts should describe applications to real-world data. However, pure algorithm papers are also welcome if future applications to biological data are to be expected, or if they address complexity or approximation issues of novel computational problems in molecular biology. Articles about novel software tools will be considered for publication if they contain some algorithmically interesting aspects.