UMI-nea: a fast, robust tool for reference-free UMI deduplication and accurate quantification.

IF 5.4

Bioinformatics (Oxford, England) Pub Date : 2025-09-01 DOI:10.1093/bioinformatics/btaf514

Jixin Deng, Jingxiao Zhang, Song Tian, John DiCarlo, Hong Xu, Samuel J Rulli, Jonathan M Shaffer, Vikas Gupta, Toeresin Karakoyun

{"title":"UMI-nea: a fast, robust tool for reference-free UMI deduplication and accurate quantification.","authors":"Jixin Deng, Jingxiao Zhang, Song Tian, John DiCarlo, Hong Xu, Samuel J Rulli, Jonathan M Shaffer, Vikas Gupta, Toeresin Karakoyun","doi":"10.1093/bioinformatics/btaf514","DOIUrl":null,"url":null,"abstract":"Motivation: One of the key applications of Unique Molecular Identifiers (UMIs) in high-throughput sequencing is to correct for PCR amplification bias and removal of PCR duplicates, thereby improving quantification in DNA-seq and RNA-seq applications. Accurately grouping error-bearing UMIs that originate from the same input molecule through a UMI deduplication method is a critical step in this process. However, many existing UMI deduplication tools rely on simple Hamming distance comparisons or suboptimal clustering algorithms, often resulting in erroneous UMI groupings, particularly in error-prone long-read sequencing or ultra-high-depth short-read sequencing.Results: We introduce UMI-nea, a tool that utilizes Levenshtein distance comparisons and a novel clustering approach to optimize multithreading workflows. Compared against three other indel-aware UMI deduplication tools, UMI-nea achieves more accurate UMI groupings with efficient run time. It demonstrates robust performance across diverse sequencing platforms, depths, and UMI lengths. Additionally, UMI-nea incorporates a data-guided adaptive UMI filter, further enhancing quantification accuracy.Availability and implementation: UMI-nea is available on github https://github.com/Qiaseq-research/UMI-nea.git or Zenodo https://doi.org/10.5281/zenodo.16745758. Sequencing data are stored at https://qiagenpublic.blob.core.windows.net/umi-nea-datasets/.","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":" ","pages":""},"PeriodicalIF":5.4000,"publicationDate":"2025-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12453673/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Bioinformatics (Oxford, England)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1093/bioinformatics/btaf514","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Motivation: One of the key applications of Unique Molecular Identifiers (UMIs) in high-throughput sequencing is to correct for PCR amplification bias and removal of PCR duplicates, thereby improving quantification in DNA-seq and RNA-seq applications. Accurately grouping error-bearing UMIs that originate from the same input molecule through a UMI deduplication method is a critical step in this process. However, many existing UMI deduplication tools rely on simple Hamming distance comparisons or suboptimal clustering algorithms, often resulting in erroneous UMI groupings, particularly in error-prone long-read sequencing or ultra-high-depth short-read sequencing.

Results: We introduce UMI-nea, a tool that utilizes Levenshtein distance comparisons and a novel clustering approach to optimize multithreading workflows. Compared against three other indel-aware UMI deduplication tools, UMI-nea achieves more accurate UMI groupings with efficient run time. It demonstrates robust performance across diverse sequencing platforms, depths, and UMI lengths. Additionally, UMI-nea incorporates a data-guided adaptive UMI filter, further enhancing quantification accuracy.

Availability and implementation: UMI-nea is available on github https://github.com/Qiaseq-research/UMI-nea.git or Zenodo https://doi.org/10.5281/zenodo.16745758. Sequencing data are stored at https://qiagenpublic.blob.core.windows.net/umi-nea-datasets/.

查看原文本刊更多论文

UMI-nea：一个快速，强大的工具，用于无参考的UMI重复数据删除和准确定量。

动机：Unique Molecular Identifiers （UMIs）在高通量测序中的关键应用之一是纠正PCR扩增偏差和去除PCR重复，从而提高DNA-seq和RNA-seq应用中的定量。通过UMI重复数据删除方法对来自相同输入分子的带有错误的UMI进行精确分组是这一过程中的关键步骤。然而，许多现有的UMI重复数据删除工具依赖于简单的汉明距离比较或次优聚类算法，经常导致错误的UMI分组，特别是在容易出错的长读测序或超高深度短读测序中。结果：我们介绍了UMI-nea，这是一个利用Levenshtein距离比较和一种新的聚类方法来优化多线程工作流的工具。与其他三种可识别索引的UMI重复数据删除工具相比，UMI-nea实现了更精确的UMI分组和高效的运行时间。它在不同的测序平台、深度和UMI长度上表现出强大的性能。此外，UMI-nea还集成了一个数据导向的自适应UMI滤波器，进一步提高了量化精度。可用性：uni -nea可在github https://github.com/Qiaseq-research/UMI-nea.git或Zenodo https://doi.org/10.5281/zenodo.16745758上获得。测序数据存储在https://qiagenpublic.blob.core.windows.net/umi-nea-datasets/.Supplementary information网站；补充数据可在Bioinformatics网站在线获得。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Bioinformatics (Oxford, England)

自引率

0.00%

发文量