Jixin Deng, Jingxiao Zhang, Song Tian, John DiCarlo, Hong Xu, Samuel J Rulli, Jonathan M Shaffer, Vikas Gupta, Toeresin Karakoyun
{"title":"UMI-nea: a fast, robust tool for reference-free UMI deduplication and accurate quantification.","authors":"Jixin Deng, Jingxiao Zhang, Song Tian, John DiCarlo, Hong Xu, Samuel J Rulli, Jonathan M Shaffer, Vikas Gupta, Toeresin Karakoyun","doi":"10.1093/bioinformatics/btaf514","DOIUrl":null,"url":null,"abstract":"<p><strong>Motivation: </strong>One of the key applications of Unique Molecular Identifiers (UMIs) in high-throughput sequencing is to correct for PCR amplification bias and removal of PCR duplicates, thereby improving quantification in DNA-seq and RNA-seq applications. Accurately grouping error-bearing UMIs that originate from the same input molecule through a UMI deduplication method is a critical step in this process. However, many existing UMI deduplication tools rely on simple Hamming distance comparisons or suboptimal clustering algorithms, often resulting in erroneous UMI groupings, particularly in error-prone long-read sequencing or ultra-high-depth short-read sequencing.</p><p><strong>Results: </strong>We introduce UMI-nea, a tool that utilizes Levenshtein distance comparisons and a novel clustering approach to optimize multithreading workflows. Compared against three other indel-aware UMI deduplication tools, UMI-nea achieves more accurate UMI groupings with efficient run time. It demonstrates robust performance across diverse sequencing platforms, depths, and UMI lengths. Additionally, UMI-nea incorporates a data-guided adaptive UMI filter, further enhancing quantification accuracy.</p><p><strong>Availability and implementation: </strong>UMI-nea is available on github https://github.com/Qiaseq-research/UMI-nea.git or Zenodo https://doi.org/10.5281/zenodo.16745758. Sequencing data are stored at https://qiagenpublic.blob.core.windows.net/umi-nea-datasets/.</p>","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":" ","pages":""},"PeriodicalIF":5.4000,"publicationDate":"2025-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12453673/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Bioinformatics (Oxford, England)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1093/bioinformatics/btaf514","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
Motivation: One of the key applications of Unique Molecular Identifiers (UMIs) in high-throughput sequencing is to correct for PCR amplification bias and removal of PCR duplicates, thereby improving quantification in DNA-seq and RNA-seq applications. Accurately grouping error-bearing UMIs that originate from the same input molecule through a UMI deduplication method is a critical step in this process. However, many existing UMI deduplication tools rely on simple Hamming distance comparisons or suboptimal clustering algorithms, often resulting in erroneous UMI groupings, particularly in error-prone long-read sequencing or ultra-high-depth short-read sequencing.
Results: We introduce UMI-nea, a tool that utilizes Levenshtein distance comparisons and a novel clustering approach to optimize multithreading workflows. Compared against three other indel-aware UMI deduplication tools, UMI-nea achieves more accurate UMI groupings with efficient run time. It demonstrates robust performance across diverse sequencing platforms, depths, and UMI lengths. Additionally, UMI-nea incorporates a data-guided adaptive UMI filter, further enhancing quantification accuracy.
Availability and implementation: UMI-nea is available on github https://github.com/Qiaseq-research/UMI-nea.git or Zenodo https://doi.org/10.5281/zenodo.16745758. Sequencing data are stored at https://qiagenpublic.blob.core.windows.net/umi-nea-datasets/.