Multi-metric locality sensitive hashing enhances alignment accuracy of bisulfite sequencing reads: BisHash.

IF 2.8 Q2 MATHEMATICAL & COMPUTATIONAL BIOLOGY
Bioinformatics advances Pub Date : 2025-07-23 eCollection Date: 2025-01-01 DOI:10.1093/bioadv/vbaf144
Hassan Nikaein, Ali Sharifi-Zarchi
{"title":"Multi-metric locality sensitive hashing enhances alignment accuracy of bisulfite sequencing reads: BisHash.","authors":"Hassan Nikaein, Ali Sharifi-Zarchi","doi":"10.1093/bioadv/vbaf144","DOIUrl":null,"url":null,"abstract":"<p><strong>Motivation: </strong>Locality-Sensitive Hashing (LSH) is a widely used algorithm for estimating similarity between large datasets in bioinformatics, with applications in genome assembly, sequence alignment, and metagenomics. However, traditional single-metric LSH approaches often lead to inefficiencies, particularly when handling biological data where regions may have diverse evolutionary histories or structural properties. This limitation can reduce accuracy in sequence alignment, variant calling, and functional analysis.</p><p><strong>Results: </strong>We propose Multi-Metric Locality-Sensitive Hashing (M2LSH), an extension of LSH that integrates multiple similarity metrics for more accurate analysis of complex biological data. By capturing diverse sequence and structural features, M2LSH improves performance in heterogeneous and evolutionarily diverse regions. Building on this, we introduce Multi-Metric MinHash (M3Hash), enhancing sequence alignment and similarity detection. As a proof of concept, we present BisHash, which applies M2LSH to bisulfite sequencing, a key method in DNA methylation analysis. Although not fully optimized, BisHash demonstrates superior accuracy, particularly in challenging scenarios like cancer studies where traditional approaches often fail. Our results highlight the potential of M2LSH and M3Hash to advance bioinformatics research.</p><p><strong>Availability and implementation: </strong>The source code for BisHash and the test procedures for benchmarking aligners using simulated data are publicly accessible at https://github.com/hnikaein/bisHash.</p>","PeriodicalId":72368,"journal":{"name":"Bioinformatics advances","volume":"5 1","pages":"vbaf144"},"PeriodicalIF":2.8000,"publicationDate":"2025-07-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12360834/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Bioinformatics advances","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1093/bioadv/vbaf144","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2025/1/1 0:00:00","PubModel":"eCollection","JCR":"Q2","JCRName":"MATHEMATICAL & COMPUTATIONAL BIOLOGY","Score":null,"Total":0}
引用次数: 0

Abstract

Motivation: Locality-Sensitive Hashing (LSH) is a widely used algorithm for estimating similarity between large datasets in bioinformatics, with applications in genome assembly, sequence alignment, and metagenomics. However, traditional single-metric LSH approaches often lead to inefficiencies, particularly when handling biological data where regions may have diverse evolutionary histories or structural properties. This limitation can reduce accuracy in sequence alignment, variant calling, and functional analysis.

Results: We propose Multi-Metric Locality-Sensitive Hashing (M2LSH), an extension of LSH that integrates multiple similarity metrics for more accurate analysis of complex biological data. By capturing diverse sequence and structural features, M2LSH improves performance in heterogeneous and evolutionarily diverse regions. Building on this, we introduce Multi-Metric MinHash (M3Hash), enhancing sequence alignment and similarity detection. As a proof of concept, we present BisHash, which applies M2LSH to bisulfite sequencing, a key method in DNA methylation analysis. Although not fully optimized, BisHash demonstrates superior accuracy, particularly in challenging scenarios like cancer studies where traditional approaches often fail. Our results highlight the potential of M2LSH and M3Hash to advance bioinformatics research.

Availability and implementation: The source code for BisHash and the test procedures for benchmarking aligners using simulated data are publicly accessible at https://github.com/hnikaein/bisHash.

Abstract Image

Abstract Image

多度量局域敏感哈希法提高亚硫酸根序列的比对精度。
动机:位置敏感哈希(LSH)是一种广泛使用的算法,用于估计生物信息学中大型数据集之间的相似性,应用于基因组组装,序列比对和宏基因组学。然而,传统的单度量LSH方法往往导致效率低下,特别是在处理生物数据时,区域可能具有不同的进化历史或结构特性。这种限制会降低序列比对、变量调用和功能分析的准确性。结果:我们提出了多度量位置敏感哈希(M2LSH),这是LSH的扩展,集成了多个相似度量,可以更准确地分析复杂的生物数据。通过捕获不同的序列和结构特征,M2LSH提高了异构和进化多样性区域的性能。在此基础上,我们引入了多度量MinHash (M3Hash),增强了序列比对和相似性检测。作为概念证明,我们提出了BisHash,它将M2LSH应用于亚硫酸盐测序,这是DNA甲基化分析的关键方法。虽然没有完全优化,但BisHash展示了卓越的准确性,特别是在传统方法经常失败的癌症研究等具有挑战性的场景中。我们的研究结果突出了M2LSH和M3Hash在推进生物信息学研究方面的潜力。可用性和实现:可以在https://github.com/hnikaein/bisHash上公开访问BisHash的源代码和使用模拟数据对校准器进行基准测试的测试过程。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
CiteScore
1.60
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信