Bryce Kille, Ragnar Groot Koerkamp, Drake McAdams, Alan Liu, Todd Treangen
{"title":"前向采样方案密度的近似下限","authors":"Bryce Kille, Ragnar Groot Koerkamp, Drake McAdams, Alan Liu, Todd Treangen","doi":"10.1101/2024.09.06.611668","DOIUrl":null,"url":null,"abstract":"Motivation: Sampling <em>k</em>-mers is a ubiquitous task in sequence analysis algorithms. Sampling schemes such as the often-used random minimizer scheme are particularly appealing as they guarantee that at least one <em>k</em>-mer is selected out of every <em>w</em> consecutive <em>k</em>-mers. Sampling fewer <em>k</em>-mers often leads to an increase in efficiency of downstream methods. Thus, developing schemes that have low density, i.e., have a small proportion of sampled <em>k</em>-mers, is an active area of research. After over a decade of consistent efforts in both decreasing the density of practical schemes and increasing the lower bound on the best possible density, there is still a large gap between the two.\nResults: We prove a near-tight lower bound on the density of forward sampling schemes, a class of schemes that generalizes minimizer schemes. For small <em>w</em> and <em>k</em>, we find optimal schemes and observe that our bound is tight when <em>k</em> ≡ 1 (mod <em>w</em>). For large <em>w</em> and <em>k</em>, the bound can be approximated by 1/(<em>w</em>+<em>k</em>)·⌈(<em>w</em>+<em>k</em>)/<em>w</em>⌉. Importantly, our lower bound implies that existing schemes are much closer to achieving optimal density than previously known. For example, with the default minimap2 HiFi settings <em>w</em>=19 and <em>k</em>=19, we show that the best known scheme for these parameters, the double decycling-set-based minimizer of Pellow et al., is at most 3% denser than optimal, compared to the previous gap of at most 50%. Furthermore, when <em>k</em> ≡ 1 (mod <em>w</em>) and σ →∞, we show that mod-minimizers introduced by Groot Koerkamp and Pibiri achieve optimal density matching our lower bound.","PeriodicalId":501307,"journal":{"name":"bioRxiv - Bioinformatics","volume":"53 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"A near-tight lower bound on the density of forward sampling schemes\",\"authors\":\"Bryce Kille, Ragnar Groot Koerkamp, Drake McAdams, Alan Liu, Todd Treangen\",\"doi\":\"10.1101/2024.09.06.611668\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Motivation: Sampling <em>k</em>-mers is a ubiquitous task in sequence analysis algorithms. Sampling schemes such as the often-used random minimizer scheme are particularly appealing as they guarantee that at least one <em>k</em>-mer is selected out of every <em>w</em> consecutive <em>k</em>-mers. Sampling fewer <em>k</em>-mers often leads to an increase in efficiency of downstream methods. Thus, developing schemes that have low density, i.e., have a small proportion of sampled <em>k</em>-mers, is an active area of research. After over a decade of consistent efforts in both decreasing the density of practical schemes and increasing the lower bound on the best possible density, there is still a large gap between the two.\\nResults: We prove a near-tight lower bound on the density of forward sampling schemes, a class of schemes that generalizes minimizer schemes. For small <em>w</em> and <em>k</em>, we find optimal schemes and observe that our bound is tight when <em>k</em> ≡ 1 (mod <em>w</em>). For large <em>w</em> and <em>k</em>, the bound can be approximated by 1/(<em>w</em>+<em>k</em>)·⌈(<em>w</em>+<em>k</em>)/<em>w</em>⌉. Importantly, our lower bound implies that existing schemes are much closer to achieving optimal density than previously known. For example, with the default minimap2 HiFi settings <em>w</em>=19 and <em>k</em>=19, we show that the best known scheme for these parameters, the double decycling-set-based minimizer of Pellow et al., is at most 3% denser than optimal, compared to the previous gap of at most 50%. Furthermore, when <em>k</em> ≡ 1 (mod <em>w</em>) and σ →∞, we show that mod-minimizers introduced by Groot Koerkamp and Pibiri achieve optimal density matching our lower bound.\",\"PeriodicalId\":501307,\"journal\":{\"name\":\"bioRxiv - Bioinformatics\",\"volume\":\"53 1\",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-09-11\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"bioRxiv - Bioinformatics\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1101/2024.09.06.611668\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"bioRxiv - Bioinformatics","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1101/2024.09.06.611668","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
摘要
动机在序列分析算法中,k-单体取样是一项无处不在的任务。取样方案(如常用的随机最小化方案)特别吸引人,因为它们能保证从每 w 个连续 k 个单体中至少选取一个 k 个单体。减少 k-mers 的取样往往能提高下游方法的效率。因此,开发低密度方案,即采样 k-mers 的比例较小,是一个活跃的研究领域。经过十多年的不懈努力,我们既降低了实用方案的密度,也提高了最佳密度的下限,但两者之间仍有很大差距:我们证明了前向采样方案密度的一个近乎严密的下界,该方案是最小化方案的一类方案。对于较小的 w 和 k,我们找到了最优方案,并观察到当 k ≡ 1 (mod w) 时,我们的边界是紧密的。对于较大的 w 和 k,下限可近似为 1/(w+k)-⌈(w+k)/w⌉。重要的是,我们的下界意味着现有方案比以前已知的方案更接近于达到最佳密度。例如,在默认 minimap2 HiFi 设置 w=19 和 k=19 的情况下,我们发现在这些参数下已知的最佳方案,即 Pellow 等人基于双去循环集的最小化方案,密度最多比最优方案低 3%,而之前的差距最多为 50%。此外,当 k ≡ 1 (mod w) 和 σ →∞ 时,我们证明了由 Groot Koerkamp 和 Pibiri 引入的模最小化器能达到与我们的下限相匹配的最优密度。
A near-tight lower bound on the density of forward sampling schemes
Motivation: Sampling k-mers is a ubiquitous task in sequence analysis algorithms. Sampling schemes such as the often-used random minimizer scheme are particularly appealing as they guarantee that at least one k-mer is selected out of every w consecutive k-mers. Sampling fewer k-mers often leads to an increase in efficiency of downstream methods. Thus, developing schemes that have low density, i.e., have a small proportion of sampled k-mers, is an active area of research. After over a decade of consistent efforts in both decreasing the density of practical schemes and increasing the lower bound on the best possible density, there is still a large gap between the two.
Results: We prove a near-tight lower bound on the density of forward sampling schemes, a class of schemes that generalizes minimizer schemes. For small w and k, we find optimal schemes and observe that our bound is tight when k ≡ 1 (mod w). For large w and k, the bound can be approximated by 1/(w+k)·⌈(w+k)/w⌉. Importantly, our lower bound implies that existing schemes are much closer to achieving optimal density than previously known. For example, with the default minimap2 HiFi settings w=19 and k=19, we show that the best known scheme for these parameters, the double decycling-set-based minimizer of Pellow et al., is at most 3% denser than optimal, compared to the previous gap of at most 50%. Furthermore, when k ≡ 1 (mod w) and σ →∞, we show that mod-minimizers introduced by Groot Koerkamp and Pibiri achieve optimal density matching our lower bound.