Brian Yao, Chloe Hsu, Gal Goldner, Yael Michaeli, Yuval Ebenstein, Jennifer Listgarten
{"title":"利用有限的标记数据对纳米孔调用仪进行表观遗传标记的有效训练。","authors":"Brian Yao, Chloe Hsu, Gal Goldner, Yael Michaeli, Yuval Ebenstein, Jennifer Listgarten","doi":"10.1098/rsob.230449","DOIUrl":null,"url":null,"abstract":"<p><p>Nanopore sequencing platforms combined with supervised machine learning (ML) have been effective at detecting base modifications in DNA such as 5-methylcytosine (5mC) and N6-methyladenine (6mA). These ML-based nanopore callers have typically been trained on data that span all modifications on all possible DNA [Formula: see text]-mer backgrounds-a <i>complete</i> training dataset. However, as nanopore technology is pushed to more and more epigenetic modifications, such complete training data will not be feasible to obtain. Nanopore calling has historically been performed with hidden Markov models (HMMs) that cannot make successful calls for [Formula: see text]-mer contexts not seen during training because of their independent emission distributions. However, deep neural networks (DNNs), which share parameters across contexts, are increasingly being used as callers, often outperforming their HMM cousins. It stands to reason that a DNN approach should be able to better generalize to unseen [Formula: see text]-mer contexts. Indeed, herein we demonstrate that a common DNN approach (DeepSignal) outperforms a common HMM approach (Nanopolish) in the incomplete data setting. Furthermore, we propose a novel hybrid HMM-DNN approach, amortized-HMM, that outperforms both the pure HMM and DNN approaches on 5mC calling when the training data are incomplete. This type of approach is expected to be useful for calling other base modifications such as 5-hydroxymethylcytosine and for the simultaneous calling of different modifications, settings in which complete training data are not likely to be available.</p>","PeriodicalId":19629,"journal":{"name":"Open Biology","volume":"14 6","pages":"230449"},"PeriodicalIF":4.5000,"publicationDate":"2024-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11286150/pdf/","citationCount":"0","resultStr":"{\"title\":\"Effective training of nanopore callers for epigenetic marks with limited labelled data.\",\"authors\":\"Brian Yao, Chloe Hsu, Gal Goldner, Yael Michaeli, Yuval Ebenstein, Jennifer Listgarten\",\"doi\":\"10.1098/rsob.230449\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p><p>Nanopore sequencing platforms combined with supervised machine learning (ML) have been effective at detecting base modifications in DNA such as 5-methylcytosine (5mC) and N6-methyladenine (6mA). These ML-based nanopore callers have typically been trained on data that span all modifications on all possible DNA [Formula: see text]-mer backgrounds-a <i>complete</i> training dataset. However, as nanopore technology is pushed to more and more epigenetic modifications, such complete training data will not be feasible to obtain. Nanopore calling has historically been performed with hidden Markov models (HMMs) that cannot make successful calls for [Formula: see text]-mer contexts not seen during training because of their independent emission distributions. However, deep neural networks (DNNs), which share parameters across contexts, are increasingly being used as callers, often outperforming their HMM cousins. It stands to reason that a DNN approach should be able to better generalize to unseen [Formula: see text]-mer contexts. Indeed, herein we demonstrate that a common DNN approach (DeepSignal) outperforms a common HMM approach (Nanopolish) in the incomplete data setting. Furthermore, we propose a novel hybrid HMM-DNN approach, amortized-HMM, that outperforms both the pure HMM and DNN approaches on 5mC calling when the training data are incomplete. This type of approach is expected to be useful for calling other base modifications such as 5-hydroxymethylcytosine and for the simultaneous calling of different modifications, settings in which complete training data are not likely to be available.</p>\",\"PeriodicalId\":19629,\"journal\":{\"name\":\"Open Biology\",\"volume\":\"14 6\",\"pages\":\"230449\"},\"PeriodicalIF\":4.5000,\"publicationDate\":\"2024-06-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11286150/pdf/\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Open Biology\",\"FirstCategoryId\":\"99\",\"ListUrlMain\":\"https://doi.org/10.1098/rsob.230449\",\"RegionNum\":3,\"RegionCategory\":\"生物学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"2024/6/12 0:00:00\",\"PubModel\":\"Epub\",\"JCR\":\"Q1\",\"JCRName\":\"BIOCHEMISTRY & MOLECULAR BIOLOGY\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Open Biology","FirstCategoryId":"99","ListUrlMain":"https://doi.org/10.1098/rsob.230449","RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2024/6/12 0:00:00","PubModel":"Epub","JCR":"Q1","JCRName":"BIOCHEMISTRY & MOLECULAR BIOLOGY","Score":null,"Total":0}
引用次数: 0
摘要
纳米孔测序平台与有监督的机器学习(ML)相结合,可有效检测 DNA 中的碱基修饰,如 5-甲基胞嘧啶(5mC)和 N6-甲基腺嘌呤(6mA)。这些基于 ML 的纳米孔呼叫器通常是在涵盖所有可能的 DNA [公式:见正文]-聚合物背景上的所有修饰的数据--一个完整的训练数据集上进行训练的。然而,随着纳米孔技术被推向越来越多的表观遗传修饰,这种完整的训练数据将难以获得。纳米孔调用历来使用隐马尔可夫模型(HMMs),由于其独立的发射分布,HMMs 无法成功调用训练期间未见的[公式:见正文]-mer 背景。然而,深度神经网络(DNN)在不同语境中共享参数,正越来越多地被用作调用器,其性能往往优于 HMM。按理说,DNN 方法应该能够更好地泛化到未见[公式:见正文]的语境中。事实上,我们在本文中证明,在不完整数据环境中,常见的 DNN 方法(DeepSignal)优于常见的 HMM 方法(Nanopolish)。此外,我们还提出了一种新颖的 HMM-DNN 混合方法(amortized-HMM),在训练数据不完整的情况下,该方法在 5mC 调用方面的表现优于纯 HMM 和 DNN 方法。这种方法有望用于调用其他碱基修饰,如 5-hydroxymethylcytosine 以及同时调用不同的修饰,因为在这种情况下不可能获得完整的训练数据。
Effective training of nanopore callers for epigenetic marks with limited labelled data.
Nanopore sequencing platforms combined with supervised machine learning (ML) have been effective at detecting base modifications in DNA such as 5-methylcytosine (5mC) and N6-methyladenine (6mA). These ML-based nanopore callers have typically been trained on data that span all modifications on all possible DNA [Formula: see text]-mer backgrounds-a complete training dataset. However, as nanopore technology is pushed to more and more epigenetic modifications, such complete training data will not be feasible to obtain. Nanopore calling has historically been performed with hidden Markov models (HMMs) that cannot make successful calls for [Formula: see text]-mer contexts not seen during training because of their independent emission distributions. However, deep neural networks (DNNs), which share parameters across contexts, are increasingly being used as callers, often outperforming their HMM cousins. It stands to reason that a DNN approach should be able to better generalize to unseen [Formula: see text]-mer contexts. Indeed, herein we demonstrate that a common DNN approach (DeepSignal) outperforms a common HMM approach (Nanopolish) in the incomplete data setting. Furthermore, we propose a novel hybrid HMM-DNN approach, amortized-HMM, that outperforms both the pure HMM and DNN approaches on 5mC calling when the training data are incomplete. This type of approach is expected to be useful for calling other base modifications such as 5-hydroxymethylcytosine and for the simultaneous calling of different modifications, settings in which complete training data are not likely to be available.
期刊介绍:
Open Biology is an online journal that welcomes original, high impact research in cell and developmental biology, molecular and structural biology, biochemistry, neuroscience, immunology, microbiology and genetics.