{"title":"Multi-scale DNA language model improves 6 mA binding sites prediction","authors":"","doi":"10.1016/j.compbiolchem.2024.108129","DOIUrl":null,"url":null,"abstract":"<div><p>DNA methylation at the N6 position of adenine (N6-methyladenine, 6 mA), which refers to the attachment of a methyl group to the N6 site of the adenine (A) of DNA, is an important epigenetic modification in prokaryotic and eukaryotic genomes. Accurately predicting the 6 mA binding sites can provide crucial insights into gene regulation, DNA repair, disease development and so on. Wet experiments are commonly used for analyzing 6 mA binding sites. However, they suffer from high cost and expensive time. Therefore, various deep learning methods have been widely used to predict 6 mA binding sites recently. In this study, we develop a framework based on multi-scale DNA language model named \"iDNA6mA-MDL\". \"iDNA6mA-MDL\" integrates multiple kmers and the nucleotide property and frequency method for feature embedding, which can capture a full range of DNA sequence context information. At the prediction stage, it also leverages DNABERT to compensate for the incomplete capture of global DNA information. Experiments show that our framework obtains average AUC of 0.981 on a classic 6 mA rice gene dataset, going beyond all existing advanced models under fivefold cross-validations. Moreover, \"iDNA6mA-MDL\" outperforms most of the popular state-of-the-art methods on another 11 6 mA datasets, demonstrating its effectiveness in 6 mA binding sites prediction.</p></div>","PeriodicalId":10616,"journal":{"name":"Computational Biology and Chemistry","volume":null,"pages":null},"PeriodicalIF":2.6000,"publicationDate":"2024-07-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Computational Biology and Chemistry","FirstCategoryId":"99","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S1476927124001178","RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"BIOLOGY","Score":null,"Total":0}
引用次数: 0
Abstract
DNA methylation at the N6 position of adenine (N6-methyladenine, 6 mA), which refers to the attachment of a methyl group to the N6 site of the adenine (A) of DNA, is an important epigenetic modification in prokaryotic and eukaryotic genomes. Accurately predicting the 6 mA binding sites can provide crucial insights into gene regulation, DNA repair, disease development and so on. Wet experiments are commonly used for analyzing 6 mA binding sites. However, they suffer from high cost and expensive time. Therefore, various deep learning methods have been widely used to predict 6 mA binding sites recently. In this study, we develop a framework based on multi-scale DNA language model named "iDNA6mA-MDL". "iDNA6mA-MDL" integrates multiple kmers and the nucleotide property and frequency method for feature embedding, which can capture a full range of DNA sequence context information. At the prediction stage, it also leverages DNABERT to compensate for the incomplete capture of global DNA information. Experiments show that our framework obtains average AUC of 0.981 on a classic 6 mA rice gene dataset, going beyond all existing advanced models under fivefold cross-validations. Moreover, "iDNA6mA-MDL" outperforms most of the popular state-of-the-art methods on another 11 6 mA datasets, demonstrating its effectiveness in 6 mA binding sites prediction.
腺嘌呤 N6 位点的 DNA 甲基化(N6-methyladenine,6 mA)是指在 DNA 的腺嘌呤(A)的 N6 位点上附着一个甲基,是原核生物和真核生物基因组中重要的表观遗传修饰。准确预测 6 mA 结合位点可以为基因调控、DNA 修复、疾病发展等提供重要的启示。湿法实验通常用于分析 6 mA 结合位点。然而,湿法实验成本高、耗时长。因此,近来各种深度学习方法被广泛用于预测 6 mA 结合位点。在本研究中,我们开发了一个基于多尺度DNA语言模型的框架,命名为 "iDNA6mA-MDL"。"iDNA6mA-MDL "整合了多个kmers和核苷酸性质与频率方法进行特征嵌入,可以捕捉DNA序列的全方位上下文信息。在预测阶段,它还利用 DNABERT 来弥补全局 DNA 信息捕获的不完整。实验表明,在经典的 6 mA 水稻基因数据集上,我们的框架获得了 0.981 的平均 AUC,在五倍交叉验证下超越了所有现有的高级模型。此外,"iDNA6mA-MDL "在另外 11 个 6 mA 数据集上的表现也优于大多数流行的先进方法,证明了它在 6 mA 结合位点预测方面的有效性。
期刊介绍:
Computational Biology and Chemistry publishes original research papers and review articles in all areas of computational life sciences. High quality research contributions with a major computational component in the areas of nucleic acid and protein sequence research, molecular evolution, molecular genetics (functional genomics and proteomics), theory and practice of either biology-specific or chemical-biology-specific modeling, and structural biology of nucleic acids and proteins are particularly welcome. Exceptionally high quality research work in bioinformatics, systems biology, ecology, computational pharmacology, metabolism, biomedical engineering, epidemiology, and statistical genetics will also be considered.
Given their inherent uncertainty, protein modeling and molecular docking studies should be thoroughly validated. In the absence of experimental results for validation, the use of molecular dynamics simulations along with detailed free energy calculations, for example, should be used as complementary techniques to support the major conclusions. Submissions of premature modeling exercises without additional biological insights will not be considered.
Review articles will generally be commissioned by the editors and should not be submitted to the journal without explicit invitation. However prospective authors are welcome to send a brief (one to three pages) synopsis, which will be evaluated by the editors.