Advancing DNA Language Models through Motif-Oriented Pre-Training with MoDNA

BioMedInformatics Pub Date : 2024-06-12 DOI:10.3390/biomedinformatics4020085

Weizhi An, Yuzhi Guo, Yatao Bian, Hehuan Ma, Jinyu Yang, Chunyuan Li, Junzhou Huang

{"title":"Advancing DNA Language Models through Motif-Oriented Pre-Training with MoDNA","authors":"Weizhi An, Yuzhi Guo, Yatao Bian, Hehuan Ma, Jinyu Yang, Chunyuan Li, Junzhou Huang","doi":"10.3390/biomedinformatics4020085","DOIUrl":null,"url":null,"abstract":"Acquiring meaningful representations of gene expression is essential for the accurate prediction of downstream regulatory tasks, such as identifying promoters and transcription factor binding sites. However, the current dependency on supervised learning, constrained by the limited availability of labeled genomic data, impedes the ability to develop robust predictive models with broad generalization capabilities. In response, recent advancements have pivoted towards the application of self-supervised training for DNA sequence modeling, enabling the adaptation of pre-trained genomic representations to a variety of downstream tasks. Departing from the straightforward application of masked language learning techniques to DNA sequences, approaches such as MoDNA enrich genome language modeling with prior biological knowledge. In this study, we advance DNA language models by utilizing the Motif-oriented DNA (MoDNA) pre-training framework, which is established for self-supervised learning at the pre-training stage and is flexible enough for application across different downstream tasks. MoDNA distinguishes itself by efficiently learning semantic-level genomic representations from an extensive corpus of unlabeled genome data, offering a significant improvement in computational efficiency over previous approaches. The framework is pre-trained on a comprehensive human genome dataset and fine-tuned for targeted downstream tasks. Our enhanced analysis and evaluation in promoter prediction and transcription factor binding site prediction have further validated MoDNA’s exceptional capabilities, emphasizing its contribution to advancements in genomic predictive modeling.","PeriodicalId":72394,"journal":{"name":"BioMedInformatics","volume":"128 32","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-06-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"BioMedInformatics","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.3390/biomedinformatics4020085","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Acquiring meaningful representations of gene expression is essential for the accurate prediction of downstream regulatory tasks, such as identifying promoters and transcription factor binding sites. However, the current dependency on supervised learning, constrained by the limited availability of labeled genomic data, impedes the ability to develop robust predictive models with broad generalization capabilities. In response, recent advancements have pivoted towards the application of self-supervised training for DNA sequence modeling, enabling the adaptation of pre-trained genomic representations to a variety of downstream tasks. Departing from the straightforward application of masked language learning techniques to DNA sequences, approaches such as MoDNA enrich genome language modeling with prior biological knowledge. In this study, we advance DNA language models by utilizing the Motif-oriented DNA (MoDNA) pre-training framework, which is established for self-supervised learning at the pre-training stage and is flexible enough for application across different downstream tasks. MoDNA distinguishes itself by efficiently learning semantic-level genomic representations from an extensive corpus of unlabeled genome data, offering a significant improvement in computational efficiency over previous approaches. The framework is pre-trained on a comprehensive human genome dataset and fine-tuned for targeted downstream tasks. Our enhanced analysis and evaluation in promoter prediction and transcription factor binding site prediction have further validated MoDNA’s exceptional capabilities, emphasizing its contribution to advancements in genomic predictive modeling.

查看原文本刊更多论文

通过 MoDNA 面向动机的预训练推进 DNA 语言模型的发展

获取有意义的基因表达表征对于准确预测下游调控任务（如识别启动子和转录因子结合位点）至关重要。然而，由于标记基因组数据的可用性有限，目前对监督学习的依赖阻碍了开发具有广泛泛化能力的稳健预测模型的能力。为此，最近的研究进展转向将自我监督训练应用于 DNA 序列建模，使预先训练的基因组表征能够适应各种下游任务。与直接将遮蔽语言学习技术应用于 DNA 序列不同，MoDNA 等方法利用先验生物知识丰富了基因组语言建模。在本研究中，我们利用面向动机的 DNA（MoDNA）预训练框架推进了 DNA 语言模型，该框架在预训练阶段建立了自我监督学习，并可灵活应用于不同的下游任务。MoDNA 的与众不同之处在于，它能从大量未标记的基因组数据中高效地学习语义级基因组表征，与之前的方法相比，计算效率有了显著提高。该框架在全面的人类基因组数据集上进行了预训练，并针对目标下游任务进行了微调。我们在启动子预测和转录因子结合位点预测方面的强化分析和评估进一步验证了 MoDNA 的卓越能力，强调了它对基因组预测建模进步的贡献。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

BioMedInformatics

CiteScore

1.70

自引率

0.00%

发文量