Alignment-free unique molecular identifier clustering suppresses sequencing errors for accurate detection of low-frequency DNA variants.

IF 7.7 2区生物学 Q1 BIOCHEMICAL RESEARCH METHODS

Briefings in bioinformatics Pub Date : 2025-08-31 DOI:10.1093/bib/bbaf483

Fei Yu, Haojie Xiao, Dongyang Song, Xiao Yang, Shiyue Huang, Yu Wang, Mingze Bai, Xiaoming Yao, Kunxian Shu, Dan Pu

{"title":"Alignment-free unique molecular identifier clustering suppresses sequencing errors for accurate detection of low-frequency DNA variants.","authors":"Fei Yu, Haojie Xiao, Dongyang Song, Xiao Yang, Shiyue Huang, Yu Wang, Mingze Bai, Xiaoming Yao, Kunxian Shu, Dan Pu","doi":"10.1093/bib/bbaf483","DOIUrl":null,"url":null,"abstract":"<p><p>Accurate detection of low-frequency DNA variants (below 1%) is essential in diverse biological and clinical contexts, yet remains fundamentally constrained by the high intrinsic error rates of next-generation sequencing technologies. Although unique molecular identifiers (UMIs) have significantly mitigated these errors by uniquely indexing original template molecules, their efficacy is compromised by UMI collisions and by artifacts introduced during polymerase chain reaction (PCR) amplification and sequencing, which collectively engender false-positive variant calls. Here, we present AFUMIC, an alignment-free UMI clustering framework that systematically addresses these limitations through collision-resilient UMI grouping and a consensus quality score (CQS)-guided strategy for high-fidelity consensus sequence generation. AFUMIC reduces singleton families, enhances clustering precision, and maximizes data retention, yielding 7.27-fold and 3.84-fold increases in single-strand consensus sequence and duplex consensus sequence output, respectively, compared to Du Novo. It further decreases the per-base error rate from $3.01 \\times 10^{-4}$ to $2.10 \\times 10^{-5}$ and raises the proportion of error-free positions from 45.27% to 99.85%, enabling confident detection of variants at variant allele frequencies as low as $1.0 \\times 10^{-5}$. Notably, AFUMIC exhibits superior computational efficiency, rendering it well-suited for high-throughput analysis of UMI-tagged libraries in large-scale genomic studies. Collectively, AFUMIC represents an efficient methodology for ultrasensitive variant detection and establishes a broadly applicable and computationally efficient framework for error-corrected sequencing that can be readily deployed in both clinical diagnostics and large-scale genomic research.</p>","PeriodicalId":9209,"journal":{"name":"Briefings in bioinformatics","volume":"26 5","pages":""},"PeriodicalIF":7.7000,"publicationDate":"2025-08-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12452285/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Briefings in bioinformatics","FirstCategoryId":"99","ListUrlMain":"https://doi.org/10.1093/bib/bbaf483","RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"BIOCHEMICAL RESEARCH METHODS","Score":null,"Total":0}

引用次数: 0

Abstract

Accurate detection of low-frequency DNA variants (below 1%) is essential in diverse biological and clinical contexts, yet remains fundamentally constrained by the high intrinsic error rates of next-generation sequencing technologies. Although unique molecular identifiers (UMIs) have significantly mitigated these errors by uniquely indexing original template molecules, their efficacy is compromised by UMI collisions and by artifacts introduced during polymerase chain reaction (PCR) amplification and sequencing, which collectively engender false-positive variant calls. Here, we present AFUMIC, an alignment-free UMI clustering framework that systematically addresses these limitations through collision-resilient UMI grouping and a consensus quality score (CQS)-guided strategy for high-fidelity consensus sequence generation. AFUMIC reduces singleton families, enhances clustering precision, and maximizes data retention, yielding 7.27-fold and 3.84-fold increases in single-strand consensus sequence and duplex consensus sequence output, respectively, compared to Du Novo. It further decreases the per-base error rate from $3.01 \times 10^{-4}$ to $2.10 \times 10^{-5}$ and raises the proportion of error-free positions from 45.27% to 99.85%, enabling confident detection of variants at variant allele frequencies as low as $1.0 \times 10^{-5}$. Notably, AFUMIC exhibits superior computational efficiency, rendering it well-suited for high-throughput analysis of UMI-tagged libraries in large-scale genomic studies. Collectively, AFUMIC represents an efficient methodology for ultrasensitive variant detection and establishes a broadly applicable and computationally efficient framework for error-corrected sequencing that can be readily deployed in both clinical diagnostics and large-scale genomic research.

查看原文本刊更多论文

无比对的独特分子标识聚类抑制测序错误，以准确检测低频DNA变异。

准确检测低频DNA变异（低于1%）在不同的生物学和临床环境中是必不可少的，但仍然从根本上受到下一代测序技术高固有错误率的限制。虽然唯一分子标识符（UMIs）通过对原始模板分子进行唯一索引，大大减轻了这些错误，但其功效受到UMI碰撞和聚合酶链反应（PCR）扩增和测序过程中引入的伪影的影响，这些伪影共同产生假阳性变体调用。在这里，我们提出了AFUMIC，一个无对齐的UMI聚类框架，通过碰撞弹性UMI分组和共识质量分数（CQS）指导的高保真共识序列生成策略，系统地解决了这些限制。AFUMIC减少了单例家族，提高了聚类精度，并最大限度地保留了数据，与Du Novo相比，单链一致序列和双链一致序列的输出分别增加了7.27倍和3.84倍。它进一步将每个碱基错误率从$3.01 \times 10^{-4}$降低到$2.10 \times 10^{-5}$，并将无错误位置的比例从45.27%提高到99.85%，从而能够在低至$1.0 \times 10^{-5}$的变异等位基因频率下自信地检测变异。值得注意的是，AFUMIC表现出卓越的计算效率，使其非常适合大规模基因组研究中umi标记文库的高通量分析。总的来说，AFUMIC代表了一种超灵敏变异检测的有效方法，并建立了一个广泛适用和计算效率高的错误校正测序框架，可以很容易地部署在临床诊断和大规模基因组研究中。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Briefings in bioinformatics 生物-生化研究方法

CiteScore

13.20

自引率

13.70%

发文量

549

审稿时长

6 months

期刊介绍： Briefings in Bioinformatics is an international journal serving as a platform for researchers and educators in the life sciences. It also appeals to mathematicians, statisticians, and computer scientists applying their expertise to biological challenges. The journal focuses on reviews tailored for users of databases and analytical tools in contemporary genetics, molecular and systems biology. It stands out by offering practical assistance and guidance to non-specialists in computerized methodologies. Covering a wide range from introductory concepts to specific protocols and analyses, the papers address bacterial, plant, fungal, animal, and human data. The journal's detailed subject areas include genetic studies of phenotypes and genotypes, mapping, DNA sequencing, expression profiling, gene expression studies, microarrays, alignment methods, protein profiles and HMMs, lipids, metabolic and signaling pathways, structure determination and function prediction, phylogenetic studies, and education and training.