Effective training of nanopore callers for epigenetic marks with limited labelled data.

IF 4.5 3区生物学 Q1 BIOCHEMISTRY & MOLECULAR BIOLOGY

Open Biology Pub Date : 2024-06-01 Epub Date: 2024-06-12 DOI:10.1098/rsob.230449

Brian Yao, Chloe Hsu, Gal Goldner, Yael Michaeli, Yuval Ebenstein, Jennifer Listgarten

{"title":"Effective training of nanopore callers for epigenetic marks with limited labelled data.","authors":"Brian Yao, Chloe Hsu, Gal Goldner, Yael Michaeli, Yuval Ebenstein, Jennifer Listgarten","doi":"10.1098/rsob.230449","DOIUrl":null,"url":null,"abstract":"Nanopore sequencing platforms combined with supervised machine learning (ML) have been effective at detecting base modifications in DNA such as 5-methylcytosine (5mC) and N6-methyladenine (6mA). These ML-based nanopore callers have typically been trained on data that span all modifications on all possible DNA [Formula: see text]-mer backgrounds-a complete training dataset. However, as nanopore technology is pushed to more and more epigenetic modifications, such complete training data will not be feasible to obtain. Nanopore calling has historically been performed with hidden Markov models (HMMs) that cannot make successful calls for [Formula: see text]-mer contexts not seen during training because of their independent emission distributions. However, deep neural networks (DNNs), which share parameters across contexts, are increasingly being used as callers, often outperforming their HMM cousins. It stands to reason that a DNN approach should be able to better generalize to unseen [Formula: see text]-mer contexts. Indeed, herein we demonstrate that a common DNN approach (DeepSignal) outperforms a common HMM approach (Nanopolish) in the incomplete data setting. Furthermore, we propose a novel hybrid HMM-DNN approach, amortized-HMM, that outperforms both the pure HMM and DNN approaches on 5mC calling when the training data are incomplete. This type of approach is expected to be useful for calling other base modifications such as 5-hydroxymethylcytosine and for the simultaneous calling of different modifications, settings in which complete training data are not likely to be available.","PeriodicalId":19629,"journal":{"name":"Open Biology","volume":"14 6","pages":"230449"},"PeriodicalIF":4.5000,"publicationDate":"2024-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11286150/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Open Biology","FirstCategoryId":"99","ListUrlMain":"https://doi.org/10.1098/rsob.230449","RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2024/6/12 0:00:00","PubModel":"Epub","JCR":"Q1","JCRName":"BIOCHEMISTRY & MOLECULAR BIOLOGY","Score":null,"Total":0}

引用次数: 0

Abstract

Nanopore sequencing platforms combined with supervised machine learning (ML) have been effective at detecting base modifications in DNA such as 5-methylcytosine (5mC) and N6-methyladenine (6mA). These ML-based nanopore callers have typically been trained on data that span all modifications on all possible DNA [Formula: see text]-mer backgrounds-a complete training dataset. However, as nanopore technology is pushed to more and more epigenetic modifications, such complete training data will not be feasible to obtain. Nanopore calling has historically been performed with hidden Markov models (HMMs) that cannot make successful calls for [Formula: see text]-mer contexts not seen during training because of their independent emission distributions. However, deep neural networks (DNNs), which share parameters across contexts, are increasingly being used as callers, often outperforming their HMM cousins. It stands to reason that a DNN approach should be able to better generalize to unseen [Formula: see text]-mer contexts. Indeed, herein we demonstrate that a common DNN approach (DeepSignal) outperforms a common HMM approach (Nanopolish) in the incomplete data setting. Furthermore, we propose a novel hybrid HMM-DNN approach, amortized-HMM, that outperforms both the pure HMM and DNN approaches on 5mC calling when the training data are incomplete. This type of approach is expected to be useful for calling other base modifications such as 5-hydroxymethylcytosine and for the simultaneous calling of different modifications, settings in which complete training data are not likely to be available.

查看原文本刊更多论文

利用有限的标记数据对纳米孔调用仪进行表观遗传标记的有效训练。

纳米孔测序平台与有监督的机器学习（ML）相结合，可有效检测 DNA 中的碱基修饰，如 5-甲基胞嘧啶（5mC）和 N6-甲基腺嘌呤（6mA）。这些基于 ML 的纳米孔呼叫器通常是在涵盖所有可能的 DNA [公式：见正文]-聚合物背景上的所有修饰的数据--一个完整的训练数据集上进行训练的。然而，随着纳米孔技术被推向越来越多的表观遗传修饰，这种完整的训练数据将难以获得。纳米孔调用历来使用隐马尔可夫模型（HMMs），由于其独立的发射分布，HMMs 无法成功调用训练期间未见的[公式：见正文]-mer 背景。然而，深度神经网络（DNN）在不同语境中共享参数，正越来越多地被用作调用器，其性能往往优于 HMM。按理说，DNN 方法应该能够更好地泛化到未见[公式：见正文]的语境中。事实上，我们在本文中证明，在不完整数据环境中，常见的 DNN 方法（DeepSignal）优于常见的 HMM 方法（Nanopolish）。此外，我们还提出了一种新颖的 HMM-DNN 混合方法（amortized-HMM），在训练数据不完整的情况下，该方法在 5mC 调用方面的表现优于纯 HMM 和 DNN 方法。这种方法有望用于调用其他碱基修饰，如 5-hydroxymethylcytosine 以及同时调用不同的修饰，因为在这种情况下不可能获得完整的训练数据。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Open Biology BIOCHEMISTRY & MOLECULAR BIOLOGY-

CiteScore

10.00

自引率

1.70%

发文量

136

审稿时长

6-12 weeks

期刊介绍： Open Biology is an online journal that welcomes original, high impact research in cell and developmental biology, molecular and structural biology, biochemistry, neuroscience, immunology, microbiology and genetics.