Minmers are a generalization of minimizers that enable unbiased local Jaccard estimation.

IF 4.4 3区 生物学 Q1 BIOCHEMICAL RESEARCH METHODS
Bryce Kille, Erik Garrison, Todd J Treangen, Adam M Phillippy
{"title":"Minmers are a generalization of minimizers that enable unbiased local Jaccard estimation.","authors":"Bryce Kille, Erik Garrison, Todd J Treangen, Adam M Phillippy","doi":"10.1093/bioinformatics/btad512","DOIUrl":null,"url":null,"abstract":"<p><strong>Motivation: </strong>The Jaccard similarity on k-mer sets has shown to be a convenient proxy for sequence identity. By avoiding expensive base-level alignments and comparing reduced sequence representations, tools such as MashMap can scale to massive numbers of pairwise comparisons while still providing useful similarity estimates. However, due to their reliance on minimizer winnowing, previous versions of MashMap were shown to be biased and inconsistent estimators of Jaccard similarity. This directly impacts downstream tools that rely on the accuracy of these estimates.</p><p><strong>Results: </strong>To address this, we propose the minmer winnowing scheme, which generalizes the minimizer scheme by use of a rolling minhash with multiple sampled k-mers per window. We show both theoretically and empirically that minmers yield an unbiased estimator of local Jaccard similarity, and we implement this scheme in an updated version of MashMap. The minmer-based implementation is over 10 times faster than the minimizer-based version under the default ANI threshold, making it well-suited for large-scale comparative genomics applications.</p><p><strong>Availability and implementation: </strong>MashMap3 is available at https://github.com/marbl/MashMap.</p>","PeriodicalId":8903,"journal":{"name":"Bioinformatics","volume":"39 9","pages":""},"PeriodicalIF":4.4000,"publicationDate":"2023-09-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10505501/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Bioinformatics","FirstCategoryId":"99","ListUrlMain":"https://doi.org/10.1093/bioinformatics/btad512","RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"BIOCHEMICAL RESEARCH METHODS","Score":null,"Total":0}
引用次数: 0

Abstract

Motivation: The Jaccard similarity on k-mer sets has shown to be a convenient proxy for sequence identity. By avoiding expensive base-level alignments and comparing reduced sequence representations, tools such as MashMap can scale to massive numbers of pairwise comparisons while still providing useful similarity estimates. However, due to their reliance on minimizer winnowing, previous versions of MashMap were shown to be biased and inconsistent estimators of Jaccard similarity. This directly impacts downstream tools that rely on the accuracy of these estimates.

Results: To address this, we propose the minmer winnowing scheme, which generalizes the minimizer scheme by use of a rolling minhash with multiple sampled k-mers per window. We show both theoretically and empirically that minmers yield an unbiased estimator of local Jaccard similarity, and we implement this scheme in an updated version of MashMap. The minmer-based implementation is over 10 times faster than the minimizer-based version under the default ANI threshold, making it well-suited for large-scale comparative genomics applications.

Availability and implementation: MashMap3 is available at https://github.com/marbl/MashMap.

Abstract Image

Abstract Image

Abstract Image

最小化器是实现无偏局部Jaccard估计的最小化器的推广。
动机:k-mer集上的Jaccard相似性已被证明是序列同一性的一个方便的代理。通过避免昂贵的基层比对和比较简化的序列表示,MashMap等工具可以扩展到大量的成对比较,同时仍然提供有用的相似性估计。然而,由于它们依赖于最小化筛选,以前版本的MashMap被证明是对Jaccard相似性的有偏差和不一致的估计。这直接影响了依赖这些估计准确性的下游工具。结果:为了解决这个问题,我们提出了minmer筛选方案,该方案通过使用每个窗口具有多个采样k-mer的滚动minhash来推广最小化器方案。我们从理论和经验上证明了minmers产生了局部Jaccard相似性的无偏估计,并在MashMap的更新版本中实现了该方案。在默认ANI阈值下,基于minmer的实现比基于minimizer的版本快10多倍,非常适合大规模的比较基因组学应用。可用性和实现:MashMap3可在https://github.com/marbl/MashMap.
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
Bioinformatics
Bioinformatics 生物-生化研究方法
CiteScore
11.20
自引率
5.20%
发文量
753
审稿时长
2.1 months
期刊介绍: The leading journal in its field, Bioinformatics publishes the highest quality scientific papers and review articles of interest to academic and industrial researchers. Its main focus is on new developments in genome bioinformatics and computational biology. Two distinct sections within the journal - Discovery Notes and Application Notes- focus on shorter papers; the former reporting biologically interesting discoveries using computational methods, the latter exploring the applications used for experiments.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信