MuLan-Methyl——基于多变换器的精确DNA甲基化预测语言模型

IF 11.8 2区生物学 Q1 MULTIDISCIPLINARY SCIENCES

GigaScience Pub Date : 2022-12-28 DOI:10.1101/2023.01.04.522704

Wenhuan Zeng, A. Gautam, D. Huson

{"title":"MuLan-Methyl——基于多变换器的精确DNA甲基化预测语言模型","authors":"Wenhuan Zeng, A. Gautam, D. Huson","doi":"10.1101/2023.01.04.522704","DOIUrl":null,"url":null,"abstract":"Transformer-based language models are successfully used to address massive text-related tasks. DNA methylation is an important epigenetic mechanism and its analysis provides valuable insights into gene regulation and biomarker identification. Several deep learning-based methods have been proposed to identify DNA methylation and each seeks to strike a balance between computational effort and accuracy. Here, we introduce MuLan-Methyl, a deep-learning framework for predicting DNA methylation sites, which is based on five popular transformer-based language models. The framework identifies methylation sites for three different types of DNA methylation, namely N6-adenine, N4-cytosine, and 5-hydroxymethylcytosine. Each of the employed language models is adapted to the task using the “pre-train and fine-tune” paradigm. Pre-training is performed on a custom corpus of DNA fragments and taxonomy lineages using self-supervised learning. Fine-tuning aims at predicting the DNA-methylation status of each type. The five models are used to collectively predict the DNA methylation status. We report excellent performance of MuLan-Methyl on a benchmark dataset. Moreover, we argue that the model captures characteristic differences between different species that are relevant for methylation. This work demonstrates that language models can be successfully adapted to applications in biological sequence analysis and that joint utilization of different language models improves model performance. Mulan-Methyl is open source and we provide a web server that implements the approach. Key points MuLan-Methyl aims at identifying three types of DNA-methylation sites. It uses an ensemble of five transformer-based language models, which were pre-trained and fine-tuned on a custom corpus. The self-attention mechanism of transformers give rise to importance scores, which can be used to extract motifs. The method performs favorably in comparison to existing methods. The implementation can be applied to chromosomal sequences to predict methylation sites.","PeriodicalId":12581,"journal":{"name":"GigaScience","volume":null,"pages":null},"PeriodicalIF":11.8000,"publicationDate":"2022-12-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"5","resultStr":"{\"title\":\"MuLan-Methyl—multiple transformer-based language models for accurate DNA methylation prediction\",\"authors\":\"Wenhuan Zeng, A. Gautam, D. Huson\",\"doi\":\"10.1101/2023.01.04.522704\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Transformer-based language models are successfully used to address massive text-related tasks. DNA methylation is an important epigenetic mechanism and its analysis provides valuable insights into gene regulation and biomarker identification. Several deep learning-based methods have been proposed to identify DNA methylation and each seeks to strike a balance between computational effort and accuracy. Here, we introduce MuLan-Methyl, a deep-learning framework for predicting DNA methylation sites, which is based on five popular transformer-based language models. The framework identifies methylation sites for three different types of DNA methylation, namely N6-adenine, N4-cytosine, and 5-hydroxymethylcytosine. Each of the employed language models is adapted to the task using the “pre-train and fine-tune” paradigm. Pre-training is performed on a custom corpus of DNA fragments and taxonomy lineages using self-supervised learning. Fine-tuning aims at predicting the DNA-methylation status of each type. The five models are used to collectively predict the DNA methylation status. We report excellent performance of MuLan-Methyl on a benchmark dataset. Moreover, we argue that the model captures characteristic differences between different species that are relevant for methylation. This work demonstrates that language models can be successfully adapted to applications in biological sequence analysis and that joint utilization of different language models improves model performance. Mulan-Methyl is open source and we provide a web server that implements the approach. Key points MuLan-Methyl aims at identifying three types of DNA-methylation sites. It uses an ensemble of five transformer-based language models, which were pre-trained and fine-tuned on a custom corpus. The self-attention mechanism of transformers give rise to importance scores, which can be used to extract motifs. The method performs favorably in comparison to existing methods. The implementation can be applied to chromosomal sequences to predict methylation sites.\",\"PeriodicalId\":12581,\"journal\":{\"name\":\"GigaScience\",\"volume\":null,\"pages\":null},\"PeriodicalIF\":11.8000,\"publicationDate\":\"2022-12-28\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"5\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"GigaScience\",\"FirstCategoryId\":\"99\",\"ListUrlMain\":\"https://doi.org/10.1101/2023.01.04.522704\",\"RegionNum\":2,\"RegionCategory\":\"生物学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"MULTIDISCIPLINARY SCIENCES\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"GigaScience","FirstCategoryId":"99","ListUrlMain":"https://doi.org/10.1101/2023.01.04.522704","RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"MULTIDISCIPLINARY SCIENCES","Score":null,"Total":0}

引用次数: 5

摘要

基于Transformer的语言模型被成功地用于处理大量与文本相关的任务。DNA甲基化是一种重要的表观遗传学机制，其分析为基因调控和生物标志物鉴定提供了有价值的见解。已经提出了几种基于深度学习的方法来识别DNA甲基化，每种方法都试图在计算工作量和准确性之间取得平衡。在这里，我们介绍了MuLan-Methyl，这是一个用于预测DNA甲基化位点的深度学习框架，它基于五个流行的基于转换器的语言模型。该框架确定了三种不同类型DNA甲基化的甲基化位点，即N6腺嘌呤、N4胞嘧啶和5-羟甲基胞嘧啶。每个使用的语言模型都使用“预训练和微调”范式来适应任务。使用自我监督学习在DNA片段和分类学谱系的自定义语料库上进行预训练。微调旨在预测每种类型的DNA甲基化状态。这五个模型用于共同预测DNA甲基化状态。我们在基准数据集上报告了MuLan-Methyl的优异性能。此外，我们认为该模型捕捉到了不同物种之间与甲基化相关的特征差异。这项工作表明，语言模型可以成功地应用于生物序列分析，并且不同语言模型的联合使用可以提高模型性能。Mulan Methyl是开源的，我们提供了一个实现该方法的web服务器。要点穆兰甲基化旨在鉴定三种类型的DNA甲基化位点。它使用了五个基于transformer的语言模型，这些模型经过预训练并在自定义语料库上进行了微调。变形金刚的自我注意机制会产生重要性分数，可用于提取基序。与现有方法相比，该方法表现良好。该实现可以应用于染色体序列来预测甲基化位点。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

MuLan-Methyl—multiple transformer-based language models for accurate DNA methylation prediction

Transformer-based language models are successfully used to address massive text-related tasks. DNA methylation is an important epigenetic mechanism and its analysis provides valuable insights into gene regulation and biomarker identification. Several deep learning-based methods have been proposed to identify DNA methylation and each seeks to strike a balance between computational effort and accuracy. Here, we introduce MuLan-Methyl, a deep-learning framework for predicting DNA methylation sites, which is based on five popular transformer-based language models. The framework identifies methylation sites for three different types of DNA methylation, namely N6-adenine, N4-cytosine, and 5-hydroxymethylcytosine. Each of the employed language models is adapted to the task using the “pre-train and fine-tune” paradigm. Pre-training is performed on a custom corpus of DNA fragments and taxonomy lineages using self-supervised learning. Fine-tuning aims at predicting the DNA-methylation status of each type. The five models are used to collectively predict the DNA methylation status. We report excellent performance of MuLan-Methyl on a benchmark dataset. Moreover, we argue that the model captures characteristic differences between different species that are relevant for methylation. This work demonstrates that language models can be successfully adapted to applications in biological sequence analysis and that joint utilization of different language models improves model performance. Mulan-Methyl is open source and we provide a web server that implements the approach. Key points MuLan-Methyl aims at identifying three types of DNA-methylation sites. It uses an ensemble of five transformer-based language models, which were pre-trained and fine-tuned on a custom corpus. The self-attention mechanism of transformers give rise to importance scores, which can be used to extract motifs. The method performs favorably in comparison to existing methods. The implementation can be applied to chromosomal sequences to predict methylation sites.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

GigaScience MULTIDISCIPLINARY SCIENCES-

CiteScore

15.50

自引率

1.10%

发文量

119

审稿时长

1 weeks

期刊介绍： GigaScience seeks to transform data dissemination and utilization in the life and biomedical sciences. As an online open-access open-data journal, it specializes in publishing "big-data" studies encompassing various fields. Its scope includes not only "omic" type data and the fields of high-throughput biology currently serviced by large public repositories, but also the growing range of more difficult-to-access data, such as imaging, neuroscience, ecology, cohort data, systems biology and other new types of large-scale shareable data.