The open-closed mod-minimizer algorithm.

IF 1.5 4区 生物学 Q4 BIOCHEMICAL RESEARCH METHODS
Ragnar Groot Koerkamp, Daniel Liu, Giulio Ermanno Pibiri
{"title":"The open-closed mod-minimizer algorithm.","authors":"Ragnar Groot Koerkamp, Daniel Liu, Giulio Ermanno Pibiri","doi":"10.1186/s13015-025-00270-0","DOIUrl":null,"url":null,"abstract":"<p><p>Sampling algorithms that deterministically select a subset of <math><mi>k</mi></math> -mers are an important building block in bioinformatics applications. For example, they are used to index large textual collections, like DNA, and to compare sequences quickly. In such applications, a sampling algorithm is required to select one <math><mi>k</mi></math> -mer out of every window of w consecutive <math><mi>k</mi></math> -mers. The folklore and most used scheme is the random minimizer that selects the smallest <math><mi>k</mi></math> -mer in the window according to some random order. This scheme is remarkably simple and versatile, and has a density (expected fraction of selected <math><mi>k</mi></math> -mers) of <math><mrow><mn>2</mn> <mo>/</mo> <mo>(</mo> <mi>w</mi> <mo>+</mo> <mn>1</mn> <mo>)</mo></mrow> </math> . In practice, lower density leads to faster methods and smaller indexes, and it turns out that the random minimizer is not the best one can do. Indeed, some schemes are known to approach optimal density 1/w when <math><mrow><mi>k</mi> <mo>→</mo> <mi>∞</mi></mrow> </math> , like the recently introduced mod-minimizer (Groot Koerkamp and Pibiri, WABI 2024). In this work, we study methods that achieve low density when <math><mrow><mi>k</mi> <mo>≤</mo> <mi>w</mi></mrow> </math> . In this small-k regime, a practical method with provably better density than the random minimizer is the miniception (Zheng et al., Bioinformatics 2021). This method can be elegantly described as sampling the smallest closed sycnmer (Edgar, PeerJ 2021) in the window according to some random order. We show that extending the miniception to prefer sampling open syncmers yields much better density. This new method-the open-closed minimizer-offers improved density for small <math><mrow><mi>k</mi> <mo>≤</mo> <mi>w</mi></mrow> </math> while being as fast to compute as the random minimizer. Compared to methods based on decycling sets, that achieve very low density in the small-k regime, our method has comparable density while being computationally simpler and intuitive. Furthermore, we extend the mod-minimizer to improve density of any scheme that works well for small k to also work well when <math><mrow><mi>k</mi> <mo>></mo> <mi>w</mi></mrow> </math> is large. We hence obtain the open-closed mod-minimizer, a practical method that improves over the mod-minimizer for all k.</p>","PeriodicalId":50823,"journal":{"name":"Algorithms for Molecular Biology","volume":"20 1","pages":"4"},"PeriodicalIF":1.5000,"publicationDate":"2025-03-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11912762/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Algorithms for Molecular Biology","FirstCategoryId":"99","ListUrlMain":"https://doi.org/10.1186/s13015-025-00270-0","RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q4","JCRName":"BIOCHEMICAL RESEARCH METHODS","Score":null,"Total":0}
引用次数: 0

Abstract

Sampling algorithms that deterministically select a subset of k -mers are an important building block in bioinformatics applications. For example, they are used to index large textual collections, like DNA, and to compare sequences quickly. In such applications, a sampling algorithm is required to select one k -mer out of every window of w consecutive k -mers. The folklore and most used scheme is the random minimizer that selects the smallest k -mer in the window according to some random order. This scheme is remarkably simple and versatile, and has a density (expected fraction of selected k -mers) of 2 / ( w + 1 ) . In practice, lower density leads to faster methods and smaller indexes, and it turns out that the random minimizer is not the best one can do. Indeed, some schemes are known to approach optimal density 1/w when k , like the recently introduced mod-minimizer (Groot Koerkamp and Pibiri, WABI 2024). In this work, we study methods that achieve low density when k w . In this small-k regime, a practical method with provably better density than the random minimizer is the miniception (Zheng et al., Bioinformatics 2021). This method can be elegantly described as sampling the smallest closed sycnmer (Edgar, PeerJ 2021) in the window according to some random order. We show that extending the miniception to prefer sampling open syncmers yields much better density. This new method-the open-closed minimizer-offers improved density for small k w while being as fast to compute as the random minimizer. Compared to methods based on decycling sets, that achieve very low density in the small-k regime, our method has comparable density while being computationally simpler and intuitive. Furthermore, we extend the mod-minimizer to improve density of any scheme that works well for small k to also work well when k > w is large. We hence obtain the open-closed mod-minimizer, a practical method that improves over the mod-minimizer for all k.

开闭模最小化算法。
确定性地选择k -mers子集的采样算法是生物信息学应用中的重要组成部分。例如,它们用于为大型文本集合(如DNA)建立索引,并用于快速比较序列。在这样的应用中,需要一个采样算法从w个连续k -mer的每个窗口中选择一个k -mer。最流行和最常用的方案是随机最小化器,它根据随机顺序选择窗口中最小的k -mer。该方案非常简单和通用,其密度(所选k -mers的期望分数)为2 / (w + 1)。在实践中,较低的密度会导致更快的方法和更小的索引,并且事实证明随机最小化器并不是最好的方法。事实上,已知一些方案在k→∞时接近最优密度1/w,例如最近引入的模最小化器(Groot Koerkamp和Pibiri, WABI 2024)。在这项工作中,我们研究了k≤w时实现低密度的方法。在这个小k范围内,一个可证明比随机最小化器密度更好的实用方法是miniception (Zheng et al., Bioinformatics 2021)。这种方法可以优雅地描述为根据随机顺序对窗口中最小的封闭同步子(Edgar, PeerJ 2021)进行采样。我们证明了扩展miniception来选择采样开放的synsyners可以产生更好的密度。这种新方法——开闭最小化器——在k≤w时提供了改进的密度,同时与随机最小化器一样快速计算。与基于循环集的方法相比,在小k范围内密度非常低,我们的方法具有相当的密度,同时计算更简单和直观。此外,我们扩展了模型最小化器,以提高任何方案的密度,该方案适用于小k,也适用于大k b> w。因此,我们得到了开闭模最小器,这是一种实用的方法,对所有k的模最小器都有改进。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
Algorithms for Molecular Biology
Algorithms for Molecular Biology 生物-生化研究方法
CiteScore
2.40
自引率
10.00%
发文量
16
审稿时长
>12 weeks
期刊介绍: Algorithms for Molecular Biology publishes articles on novel algorithms for biological sequence and structure analysis, phylogeny reconstruction, and combinatorial algorithms and machine learning. Areas of interest include but are not limited to: algorithms for RNA and protein structure analysis, gene prediction and genome analysis, comparative sequence analysis and alignment, phylogeny, gene expression, machine learning, and combinatorial algorithms. Where appropriate, manuscripts should describe applications to real-world data. However, pure algorithm papers are also welcome if future applications to biological data are to be expected, or if they address complexity or approximation issues of novel computational problems in molecular biology. Articles about novel software tools will be considered for publication if they contain some algorithmically interesting aspects.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信