ReQuant: improved base modification calling by k-mer value imputation

IF 13.1 2区生物学 Q1 BIOCHEMISTRY & MOLECULAR BIOLOGY

Nucleic Acids Research Pub Date : 2025-05-10 DOI:10.1093/nar/gkaf323

Roy Straver, Carlo Vermeulen, Joe R Verity-Legg, Marc Pagès-Gallego, Dieter G G Stoker, Alexander van Oudenaarden, Jeroen de Ridder

{"title":"ReQuant: improved base modification calling by k-mer value imputation","authors":"Roy Straver, Carlo Vermeulen, Joe R Verity-Legg, Marc Pagès-Gallego, Dieter G G Stoker, Alexander van Oudenaarden, Jeroen de Ridder","doi":"10.1093/nar/gkaf323","DOIUrl":null,"url":null,"abstract":"Nanopore sequencing allows identification of base modifications, such as methylation, directly from raw current data. Prevailing approaches, including deep learning (DL) methods, require training data covering all possible sequence contexts. These data can be prohibitively expensive or impossible to obtain for some modifications. Hence, research into DNA modifications focuses on the most prevalent modification in human DNA: 5mC in a CpG context. Improved generalization is required to reach the technology’s full potential: calling any modification from raw current values. We developed ReQuant, an algorithm to impute full, k-mer based, modification models from limited k-mer context training data. ReQuant is highly accurate for calling modifications (CpG/GpC methylation and CpG glucosylation) in Lambda Phage R9 data when fitting on ≤25% of all possible 6-mers with a modification and extends to human R10 data. The success of our approach shows that DNA modifications have a consistent and therefore predictable effect on Nanopore current levels, suggesting that interpretable rule-based imputation in unseen contexts is possible. Our approach circumvents the need for modification-specific DL tools and enables modification calling when not all sequence contexts can be obtained, opening a vast field of biological base modification research.","PeriodicalId":19471,"journal":{"name":"Nucleic Acids Research","volume":"9 1","pages":""},"PeriodicalIF":13.1000,"publicationDate":"2025-05-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Nucleic Acids Research","FirstCategoryId":"99","ListUrlMain":"https://doi.org/10.1093/nar/gkaf323","RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"BIOCHEMISTRY & MOLECULAR BIOLOGY","Score":null,"Total":0}

引用次数: 0

Abstract

Nanopore sequencing allows identification of base modifications, such as methylation, directly from raw current data. Prevailing approaches, including deep learning (DL) methods, require training data covering all possible sequence contexts. These data can be prohibitively expensive or impossible to obtain for some modifications. Hence, research into DNA modifications focuses on the most prevalent modification in human DNA: 5mC in a CpG context. Improved generalization is required to reach the technology’s full potential: calling any modification from raw current values. We developed ReQuant, an algorithm to impute full, k-mer based, modification models from limited k-mer context training data. ReQuant is highly accurate for calling modifications (CpG/GpC methylation and CpG glucosylation) in Lambda Phage R9 data when fitting on ≤25% of all possible 6-mers with a modification and extends to human R10 data. The success of our approach shows that DNA modifications have a consistent and therefore predictable effect on Nanopore current levels, suggesting that interpretable rule-based imputation in unseen contexts is possible. Our approach circumvents the need for modification-specific DL tools and enables modification calling when not all sequence contexts can be obtained, opening a vast field of biological base modification research.

查看原文本刊更多论文

要求：改进了k-mer值插入的基基修改调用

纳米孔测序允许鉴定碱基修饰，如甲基化，直接从原始电流数据。包括深度学习（DL）方法在内的主流方法需要训练数据涵盖所有可能的序列上下文。对于某些修改，这些数据可能非常昂贵或无法获得。因此，对DNA修饰的研究主要集中在人类DNA中最普遍的修饰：CpG背景下的5mC。为了充分发挥该技术的潜力，需要改进泛化：从原始电流值调用任何修改。我们开发了ReQuant，这是一种从有限的k-mer上下文训练数据中推导完整的、基于k-mer的修改模型的算法。ReQuant在Lambda噬菌体R9数据中调用修饰（CpG/GpC甲基化和CpG糖基化）时高度准确，拟合≤25%的所有可能的6-mers修饰，并扩展到人类R10数据。我们方法的成功表明，DNA修饰对纳米孔电流水平具有一致的、因此可预测的影响，这表明在看不见的环境中可解释的基于规则的输入是可能的。我们的方法避免了对特定于修饰的DL工具的需求，并在无法获得所有序列上下文的情况下实现了修饰调用，为生物碱基修饰研究开辟了广阔的领域。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Nucleic Acids Research 生物-生化与分子生物学

CiteScore

27.10

自引率

4.70%

发文量

1057

审稿时长

2 months

期刊介绍： Nucleic Acids Research (NAR) is a scientific journal that publishes research on various aspects of nucleic acids and proteins involved in nucleic acid metabolism and interactions. It covers areas such as chemistry and synthetic biology, computational biology, gene regulation, chromatin and epigenetics, genome integrity, repair and replication, genomics, molecular biology, nucleic acid enzymes, RNA, and structural biology. The journal also includes a Survey and Summary section for brief reviews. Additionally, each year, the first issue is dedicated to biological databases, and an issue in July focuses on web-based software resources for the biological community. Nucleic Acids Research is indexed by several services including Abstracts on Hygiene and Communicable Diseases, Animal Breeding Abstracts, Agricultural Engineering Abstracts, Agbiotech News and Information, BIOSIS Previews, CAB Abstracts, and EMBASE.