Estimating similarity and distance using FracMinHash.

IF 1.5 4区 生物学 Q4 BIOCHEMICAL RESEARCH METHODS
Mahmudur Rahman Hera, David Koslicki
{"title":"Estimating similarity and distance using FracMinHash.","authors":"Mahmudur Rahman Hera, David Koslicki","doi":"10.1186/s13015-025-00276-8","DOIUrl":null,"url":null,"abstract":"<p><strong>Motivation: </strong>The increasing number and volume of genomic and metagenomic data necessitates scalable and robust computational models for precise analysis. Sketching techniques utilizing <math><mi>k</mi></math> -mers from a biological sample have proven to be useful for large-scale analyses. In recent years, FracMinHash has emerged as a popular sketching technique and has been used in several useful applications. Recent studies on FracMinHash proved unbiased estimators for the containment and Jaccard indices. However, theoretical investigations for other metrics are still lacking.</p><p><strong>Theoretical contributions: </strong>In this paper, we present a theoretical framework for estimating similarity/distance metrics by using FracMinHash sketches, when the metric is expressible in a certain form. We establish conditions under which such an estimation is sound and recommend a minimum scale factor s for accurate results. Experimental evidence supports our theoretical findings.</p><p><strong>Practical contributions: </strong>We also present frac-kmc, a fast and efficient FracMinHash sketch generator program. frac-kmc is the fastest known FracMinHash sketch generator, delivering accurate and precise results for cosine similarity estimation on real data. frac-kmc is also the first parallel tool for this task, allowing for speeding up sketch generation using multiple CPU cores - an option lacking in existing serialized tools. We show that by computing FracMinHash sketches using frac-kmc, we can estimate pairwise similarity speedily and accurately on real data. frac-kmc is freely available here: https://github.com/KoslickiLab/frac-kmc/.</p>","PeriodicalId":50823,"journal":{"name":"Algorithms for Molecular Biology","volume":"20 1","pages":"8"},"PeriodicalIF":1.5000,"publicationDate":"2025-05-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12082993/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Algorithms for Molecular Biology","FirstCategoryId":"99","ListUrlMain":"https://doi.org/10.1186/s13015-025-00276-8","RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q4","JCRName":"BIOCHEMICAL RESEARCH METHODS","Score":null,"Total":0}
引用次数: 0

Abstract

Motivation: The increasing number and volume of genomic and metagenomic data necessitates scalable and robust computational models for precise analysis. Sketching techniques utilizing k -mers from a biological sample have proven to be useful for large-scale analyses. In recent years, FracMinHash has emerged as a popular sketching technique and has been used in several useful applications. Recent studies on FracMinHash proved unbiased estimators for the containment and Jaccard indices. However, theoretical investigations for other metrics are still lacking.

Theoretical contributions: In this paper, we present a theoretical framework for estimating similarity/distance metrics by using FracMinHash sketches, when the metric is expressible in a certain form. We establish conditions under which such an estimation is sound and recommend a minimum scale factor s for accurate results. Experimental evidence supports our theoretical findings.

Practical contributions: We also present frac-kmc, a fast and efficient FracMinHash sketch generator program. frac-kmc is the fastest known FracMinHash sketch generator, delivering accurate and precise results for cosine similarity estimation on real data. frac-kmc is also the first parallel tool for this task, allowing for speeding up sketch generation using multiple CPU cores - an option lacking in existing serialized tools. We show that by computing FracMinHash sketches using frac-kmc, we can estimate pairwise similarity speedily and accurately on real data. frac-kmc is freely available here: https://github.com/KoslickiLab/frac-kmc/.

使用FracMinHash估计相似度和距离。
动机:基因组和宏基因组数据的数量和体积的增加需要可扩展和强大的计算模型来进行精确分析。利用来自生物样品的k -mers的草图技术已被证明对大规模分析是有用的。近年来,FracMinHash已成为一种流行的素描技术,并已在几个有用的应用中使用。最近对FracMinHash的研究证明了包含指数和Jaccard指数的无偏估计。然而,对其他指标的理论研究仍然缺乏。理论贡献:在本文中,我们提出了一个理论框架,当度量可以以某种形式表示时,通过使用FracMinHash草图来估计相似性/距离度量。我们建立了这样的估计是合理的条件,并推荐了一个最小的比例因子为准确的结果。实验证据支持我们的理论发现。实际贡献:我们还提出了frac-kmc,一个快速高效的FracMinHash草图生成器程序。FracMinHash草图生成器是已知最快的FracMinHash草图生成器,为真实数据的余弦相似度估计提供准确和精确的结果。frackmc也是该任务的第一个并行工具,允许使用多个CPU内核加速草图生成——这是现有串行化工具所缺乏的选项。通过使用frackmc计算FracMinHash草图,我们可以快速准确地估计真实数据的两两相似度。水力压裂-kmc免费下载网址:https://github.com/KoslickiLab/frac-kmc/。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
Algorithms for Molecular Biology
Algorithms for Molecular Biology 生物-生化研究方法
CiteScore
2.40
自引率
10.00%
发文量
16
审稿时长
>12 weeks
期刊介绍: Algorithms for Molecular Biology publishes articles on novel algorithms for biological sequence and structure analysis, phylogeny reconstruction, and combinatorial algorithms and machine learning. Areas of interest include but are not limited to: algorithms for RNA and protein structure analysis, gene prediction and genome analysis, comparative sequence analysis and alignment, phylogeny, gene expression, machine learning, and combinatorial algorithms. Where appropriate, manuscripts should describe applications to real-world data. However, pure algorithm papers are also welcome if future applications to biological data are to be expected, or if they address complexity or approximation issues of novel computational problems in molecular biology. Articles about novel software tools will be considered for publication if they contain some algorithmically interesting aspects.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信