Effective training of nanopore callers for epigenetic marks with limited labelled data.

IF 4.5 3区 生物学 Q1 BIOCHEMISTRY & MOLECULAR BIOLOGY
Open Biology Pub Date : 2024-06-01 Epub Date: 2024-06-12 DOI:10.1098/rsob.230449
Brian Yao, Chloe Hsu, Gal Goldner, Yael Michaeli, Yuval Ebenstein, Jennifer Listgarten
{"title":"Effective training of nanopore callers for epigenetic marks with limited labelled data.","authors":"Brian Yao, Chloe Hsu, Gal Goldner, Yael Michaeli, Yuval Ebenstein, Jennifer Listgarten","doi":"10.1098/rsob.230449","DOIUrl":null,"url":null,"abstract":"<p><p>Nanopore sequencing platforms combined with supervised machine learning (ML) have been effective at detecting base modifications in DNA such as 5-methylcytosine (5mC) and N6-methyladenine (6mA). These ML-based nanopore callers have typically been trained on data that span all modifications on all possible DNA [Formula: see text]-mer backgrounds-a <i>complete</i> training dataset. However, as nanopore technology is pushed to more and more epigenetic modifications, such complete training data will not be feasible to obtain. Nanopore calling has historically been performed with hidden Markov models (HMMs) that cannot make successful calls for [Formula: see text]-mer contexts not seen during training because of their independent emission distributions. However, deep neural networks (DNNs), which share parameters across contexts, are increasingly being used as callers, often outperforming their HMM cousins. It stands to reason that a DNN approach should be able to better generalize to unseen [Formula: see text]-mer contexts. Indeed, herein we demonstrate that a common DNN approach (DeepSignal) outperforms a common HMM approach (Nanopolish) in the incomplete data setting. Furthermore, we propose a novel hybrid HMM-DNN approach, amortized-HMM, that outperforms both the pure HMM and DNN approaches on 5mC calling when the training data are incomplete. This type of approach is expected to be useful for calling other base modifications such as 5-hydroxymethylcytosine and for the simultaneous calling of different modifications, settings in which complete training data are not likely to be available.</p>","PeriodicalId":19629,"journal":{"name":"Open Biology","volume":"14 6","pages":"230449"},"PeriodicalIF":4.5000,"publicationDate":"2024-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11286150/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Open Biology","FirstCategoryId":"99","ListUrlMain":"https://doi.org/10.1098/rsob.230449","RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2024/6/12 0:00:00","PubModel":"Epub","JCR":"Q1","JCRName":"BIOCHEMISTRY & MOLECULAR BIOLOGY","Score":null,"Total":0}
引用次数: 0

Abstract

Nanopore sequencing platforms combined with supervised machine learning (ML) have been effective at detecting base modifications in DNA such as 5-methylcytosine (5mC) and N6-methyladenine (6mA). These ML-based nanopore callers have typically been trained on data that span all modifications on all possible DNA [Formula: see text]-mer backgrounds-a complete training dataset. However, as nanopore technology is pushed to more and more epigenetic modifications, such complete training data will not be feasible to obtain. Nanopore calling has historically been performed with hidden Markov models (HMMs) that cannot make successful calls for [Formula: see text]-mer contexts not seen during training because of their independent emission distributions. However, deep neural networks (DNNs), which share parameters across contexts, are increasingly being used as callers, often outperforming their HMM cousins. It stands to reason that a DNN approach should be able to better generalize to unseen [Formula: see text]-mer contexts. Indeed, herein we demonstrate that a common DNN approach (DeepSignal) outperforms a common HMM approach (Nanopolish) in the incomplete data setting. Furthermore, we propose a novel hybrid HMM-DNN approach, amortized-HMM, that outperforms both the pure HMM and DNN approaches on 5mC calling when the training data are incomplete. This type of approach is expected to be useful for calling other base modifications such as 5-hydroxymethylcytosine and for the simultaneous calling of different modifications, settings in which complete training data are not likely to be available.

利用有限的标记数据对纳米孔调用仪进行表观遗传标记的有效训练。
纳米孔测序平台与有监督的机器学习(ML)相结合,可有效检测 DNA 中的碱基修饰,如 5-甲基胞嘧啶(5mC)和 N6-甲基腺嘌呤(6mA)。这些基于 ML 的纳米孔呼叫器通常是在涵盖所有可能的 DNA [公式:见正文]-聚合物背景上的所有修饰的数据--一个完整的训练数据集上进行训练的。然而,随着纳米孔技术被推向越来越多的表观遗传修饰,这种完整的训练数据将难以获得。纳米孔调用历来使用隐马尔可夫模型(HMMs),由于其独立的发射分布,HMMs 无法成功调用训练期间未见的[公式:见正文]-mer 背景。然而,深度神经网络(DNN)在不同语境中共享参数,正越来越多地被用作调用器,其性能往往优于 HMM。按理说,DNN 方法应该能够更好地泛化到未见[公式:见正文]的语境中。事实上,我们在本文中证明,在不完整数据环境中,常见的 DNN 方法(DeepSignal)优于常见的 HMM 方法(Nanopolish)。此外,我们还提出了一种新颖的 HMM-DNN 混合方法(amortized-HMM),在训练数据不完整的情况下,该方法在 5mC 调用方面的表现优于纯 HMM 和 DNN 方法。这种方法有望用于调用其他碱基修饰,如 5-hydroxymethylcytosine 以及同时调用不同的修饰,因为在这种情况下不可能获得完整的训练数据。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
Open Biology
Open Biology BIOCHEMISTRY & MOLECULAR BIOLOGY-
CiteScore
10.00
自引率
1.70%
发文量
136
审稿时长
6-12 weeks
期刊介绍: Open Biology is an online journal that welcomes original, high impact research in cell and developmental biology, molecular and structural biology, biochemistry, neuroscience, immunology, microbiology and genetics.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信