Evaluating the representational power of pre-trained DNA language models for regulatory genomics

IF 10.1 1区生物学 Q1 BIOTECHNOLOGY & APPLIED MICROBIOLOGY

Genome Biology Pub Date : 2025-07-14 DOI:10.1186/s13059-025-03674-8

Ziqi Tang, Nirali Somia, Yiyang Yu, Peter K. Koo

{"title":"Evaluating the representational power of pre-trained DNA language models for regulatory genomics","authors":"Ziqi Tang, Nirali Somia, Yiyang Yu, Peter K. Koo","doi":"10.1186/s13059-025-03674-8","DOIUrl":null,"url":null,"abstract":"The emergence of genomic language models (gLMs) offers an unsupervised approach to learning a wide diversity of cis-regulatory patterns in the non-coding genome without requiring labels of functional activity generated by wet-lab experiments. Previous evaluations have shown that pre-trained gLMs can be leveraged to improve predictive performance across a broad range of regulatory genomics tasks, albeit using relatively simple benchmark datasets and baseline models. Since the gLMs in these studies were tested upon fine-tuning their weights for each downstream task, determining whether gLM representations embody a foundational understanding of cis-regulatory biology remains an open question. Here, we evaluate the representational power of pre-trained gLMs to predict and interpret cell-type-specific functional genomics data that span DNA and RNA regulation for six major functional genomics prediction tasks. Our findings suggest that probing the representations of current pre-trained gLMs do not offer substantial advantages over conventional machine learning approaches that use one-hot encoded sequences. Nevertheless, highly tuned supervised models trained from scratch using one-hot encoded sequences can achieve performance competitive with or better than pre-trained models across the datasets explored in this study. This work highlights a major gap with current gLMs, raising potential issues in conventional pre-training strategies for the non-coding genome.","PeriodicalId":12611,"journal":{"name":"Genome Biology","volume":"7 1","pages":""},"PeriodicalIF":10.1000,"publicationDate":"2025-07-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Genome Biology","FirstCategoryId":"99","ListUrlMain":"https://doi.org/10.1186/s13059-025-03674-8","RegionNum":1,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"BIOTECHNOLOGY & APPLIED MICROBIOLOGY","Score":null,"Total":0}

引用次数: 0

Abstract

The emergence of genomic language models (gLMs) offers an unsupervised approach to learning a wide diversity of cis-regulatory patterns in the non-coding genome without requiring labels of functional activity generated by wet-lab experiments. Previous evaluations have shown that pre-trained gLMs can be leveraged to improve predictive performance across a broad range of regulatory genomics tasks, albeit using relatively simple benchmark datasets and baseline models. Since the gLMs in these studies were tested upon fine-tuning their weights for each downstream task, determining whether gLM representations embody a foundational understanding of cis-regulatory biology remains an open question. Here, we evaluate the representational power of pre-trained gLMs to predict and interpret cell-type-specific functional genomics data that span DNA and RNA regulation for six major functional genomics prediction tasks. Our findings suggest that probing the representations of current pre-trained gLMs do not offer substantial advantages over conventional machine learning approaches that use one-hot encoded sequences. Nevertheless, highly tuned supervised models trained from scratch using one-hot encoded sequences can achieve performance competitive with or better than pre-trained models across the datasets explored in this study. This work highlights a major gap with current gLMs, raising potential issues in conventional pre-training strategies for the non-coding genome.

查看原文本刊更多论文

评估调控基因组学预训练DNA语言模型的代表性能力

基因组语言模型（gLMs）的出现提供了一种无监督的方法来学习非编码基因组中广泛的顺式调控模式，而不需要湿实验室实验产生的功能活性标记。先前的评估表明，尽管使用相对简单的基准数据集和基线模型，但预训练的glm可以用来提高广泛的调控基因组学任务的预测性能。由于这些研究中的gLM是通过微调每个下游任务的权重来测试的，因此确定gLM表示是否体现了对顺式调控生物学的基本理解仍然是一个悬而未决的问题。在这里，我们评估了预训练的gLMs预测和解释细胞类型特异性功能基因组学数据的代表性，这些数据跨越DNA和RNA调控，用于六个主要功能基因组学预测任务。我们的研究结果表明，与使用单热编码序列的传统机器学习方法相比，探测当前预训练的glm的表示并没有提供实质性的优势。然而，使用单热编码序列从头开始训练的高度调整的监督模型可以在本研究中探索的数据集上实现与预训练模型竞争或更好的性能。这项工作突出了与当前gLMs的主要差距，提出了传统非编码基因组预训练策略的潜在问题。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Genome Biology Biochemistry, Genetics and Molecular Biology-Genetics

CiteScore

21.00

自引率

3.30%

发文量

241

审稿时长

2 months

期刊介绍： Genome Biology stands as a premier platform for exceptional research across all domains of biology and biomedicine, explored through a genomic and post-genomic lens. With an impressive impact factor of 12.3 (2022),* the journal secures its position as the 3rd-ranked research journal in the Genetics and Heredity category and the 2nd-ranked research journal in the Biotechnology and Applied Microbiology category by Thomson Reuters. Notably, Genome Biology holds the distinction of being the highest-ranked open-access journal in this category. Our dedicated team of highly trained in-house Editors collaborates closely with our esteemed Editorial Board of international experts, ensuring the journal remains on the forefront of scientific advances and community standards. Regular engagement with researchers at conferences and institute visits underscores our commitment to staying abreast of the latest developments in the field.