Construction of edit-distance graphs for large sets of short reads through minimizer-bucketing.

IF 2.4 Q2 MATHEMATICAL & COMPUTATIONAL BIOLOGY

Bioinformatics advances Pub Date : 2025-04-10 eCollection Date: 2025-01-01 DOI:10.1093/bioadv/vbaf081

Pengyao Ping, Jinyan Li

{"title":"Construction of edit-distance graphs for large sets of short reads through minimizer-bucketing.","authors":"Pengyao Ping, Jinyan Li","doi":"10.1093/bioadv/vbaf081","DOIUrl":null,"url":null,"abstract":"Motivation: Pairs of short reads with small edit distances, along with their unique molecular identifier tags, have been exploited to correct sequencing errors in both reads and tags. However, brute-force identification of these pairs is impractical for large datasets containing ten million or more reads due to its quadratic complexity. Minimizer-bucketing and locality-sensitive hashing have been used to partition read sets into buckets of similar reads, allowing edit-distance calculations only within each bucket. However, challenges like minimizing missing pairs, optimizing bucketing parameters, and exploring combination bucketing to improve pair detection remain.Results: We define an edit-distance graph for a set of short reads, where nodes represent reads, and edges connect reads with small edit distances, and present a heuristic method, reads2graph, for high completeness of edge detection. Reads2graph uses three techniques: minimizer-bucketing, an improved Order-Min-Hash technique to divide large bins, and a novel graph neighbourhood multi-hop traversal within large bins to detect more edges. We then establish optimal bucketing settings to maximize ground truth edge coverage per bin. Extensive testing demonstrates that read2graph can achieve 97%-100% completeness in most cases, outperforming brute-force identification in speed while providing a superior speed-completeness balance compared to using a single bucketing method like Miniception or Order-Min-Hash.Availability and implementation: reads2graph is publicly available at https://github.com/JappyPing/reads2graph.","PeriodicalId":72368,"journal":{"name":"Bioinformatics advances","volume":"5 1","pages":"vbaf081"},"PeriodicalIF":2.4000,"publicationDate":"2025-04-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12040381/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Bioinformatics advances","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1093/bioadv/vbaf081","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2025/1/1 0:00:00","PubModel":"eCollection","JCR":"Q2","JCRName":"MATHEMATICAL & COMPUTATIONAL BIOLOGY","Score":null,"Total":0}

引用次数: 0

Abstract

Motivation: Pairs of short reads with small edit distances, along with their unique molecular identifier tags, have been exploited to correct sequencing errors in both reads and tags. However, brute-force identification of these pairs is impractical for large datasets containing ten million or more reads due to its quadratic complexity. Minimizer-bucketing and locality-sensitive hashing have been used to partition read sets into buckets of similar reads, allowing edit-distance calculations only within each bucket. However, challenges like minimizing missing pairs, optimizing bucketing parameters, and exploring combination bucketing to improve pair detection remain.

Results: We define an edit-distance graph for a set of short reads, where nodes represent reads, and edges connect reads with small edit distances, and present a heuristic method, reads2graph, for high completeness of edge detection. Reads2graph uses three techniques: minimizer-bucketing, an improved Order-Min-Hash technique to divide large bins, and a novel graph neighbourhood multi-hop traversal within large bins to detect more edges. We then establish optimal bucketing settings to maximize ground truth edge coverage per bin. Extensive testing demonstrates that read2graph can achieve 97%-100% completeness in most cases, outperforming brute-force identification in speed while providing a superior speed-completeness balance compared to using a single bucketing method like Miniception or Order-Min-Hash.

Availability and implementation: reads2graph is publicly available at https://github.com/JappyPing/reads2graph.

查看原文本刊更多论文

通过minimizer-bucket构建大型短读集的编辑距离图。

动机：具有小编辑距离的短读对，以及它们独特的分子标识标签，已经被用来纠正读段和标签中的测序错误。然而，由于其二次复杂度，这些对的暴力识别对于包含一千万或更多读取的大型数据集是不切实际的。minimizer -bucket和对位置敏感的散列已被用于将读集划分到类似读的bucket中，只允许在每个bucket中进行编辑距离计算。然而，诸如最小化缺失对、优化桶形参数以及探索组合桶形以改进对检测等挑战仍然存在。结果：我们为一组短读取定义了一个编辑距离图，其中节点代表读取，边缘连接具有小编辑距离的读取，并提出了一种启发式方法reads2graph，用于高完整性的边缘检测。Reads2graph使用了三种技术：最小化桶，一种改进的Order-Min-Hash技术来划分大的桶，以及一种新的图邻域多跳遍历技术来检测更多的边。然后，我们建立最佳桶设置，以最大限度地提高每个桶的地面真值边缘覆盖率。广泛的测试表明，在大多数情况下，read2graph可以实现97%-100%的完整性，在速度上优于蛮力识别，同时与使用Miniception或Order-Min-Hash等单一bucket方法相比，提供了更好的速度-完整性平衡。可用性和实现：reads2graph可以在https://github.com/JappyPing/reads2graph上公开获得。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Bioinformatics advances

CiteScore

1.60

自引率

0.00%

发文量