Genomic language models (gLMs) decode bacterial genomes for improved gene prediction and translation initiation site identification.

IF 6.8 2区生物学 Q1 BIOCHEMICAL RESEARCH METHODS

Briefings in bioinformatics Pub Date : 2025-07-02 DOI:10.1093/bib/bbaf311

Genereux Akotenou, Achraf El Allali

{"title":"Genomic language models (gLMs) decode bacterial genomes for improved gene prediction and translation initiation site identification.","authors":"Genereux Akotenou, Achraf El Allali","doi":"10.1093/bib/bbaf311","DOIUrl":null,"url":null,"abstract":"<p><p>Accurate bacterial gene prediction is essential for understanding microbial functions and advancing biotechnology. Traditional methods based on sequence homology and statistical models often struggle with complex genetic variations and novel sequences due to their limited ability to interpret the \"language of genes.\" To overcome these challenges, we explore genomic language models (gLMs)-inspired by large language models in natural language processing-to enhance bacterial gene prediction. These models learn patterns and contextual dependencies within genetic sequences, similar to how LLMs process human language. We employ transformers, specifically DNABERT, for bacterial gene prediction using a two-stage framework: first, identifying coding sequence (CDS) regions, and then refining predictions by identifying the correct translation initiation sites (TIS). DNABERT is fine-tuned on a curated set of NCBI complete bacterial genomes using a k-mer tokenizer for sequence processing. Our results show that GeneLM significantly improves gene prediction accuracy. Compared with the leading prokaryotic gene finders, Prodigal, GeneMark-HMM, and Glimmer, and other recent deep learning methods, GeneLM reduces missed CDS predictions while increasing matched annotations. More notably, our TIS predictions surpass traditional methods when tested against experimentally verified sites. GeneLM demonstrates the power of gLMs in decoding genetic information, achieving state-of-the-art performance in bacterial genome analysis. This advancement highlights the potential of language models to revolutionize genome annotation, outperforming conventional tools and enabling more precise genetic insights.</p>","PeriodicalId":9209,"journal":{"name":"Briefings in bioinformatics","volume":"26 4","pages":""},"PeriodicalIF":6.8000,"publicationDate":"2025-07-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12222049/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Briefings in bioinformatics","FirstCategoryId":"99","ListUrlMain":"https://doi.org/10.1093/bib/bbaf311","RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"BIOCHEMICAL RESEARCH METHODS","Score":null,"Total":0}

引用次数: 0

Abstract

Accurate bacterial gene prediction is essential for understanding microbial functions and advancing biotechnology. Traditional methods based on sequence homology and statistical models often struggle with complex genetic variations and novel sequences due to their limited ability to interpret the "language of genes." To overcome these challenges, we explore genomic language models (gLMs)-inspired by large language models in natural language processing-to enhance bacterial gene prediction. These models learn patterns and contextual dependencies within genetic sequences, similar to how LLMs process human language. We employ transformers, specifically DNABERT, for bacterial gene prediction using a two-stage framework: first, identifying coding sequence (CDS) regions, and then refining predictions by identifying the correct translation initiation sites (TIS). DNABERT is fine-tuned on a curated set of NCBI complete bacterial genomes using a k-mer tokenizer for sequence processing. Our results show that GeneLM significantly improves gene prediction accuracy. Compared with the leading prokaryotic gene finders, Prodigal, GeneMark-HMM, and Glimmer, and other recent deep learning methods, GeneLM reduces missed CDS predictions while increasing matched annotations. More notably, our TIS predictions surpass traditional methods when tested against experimentally verified sites. GeneLM demonstrates the power of gLMs in decoding genetic information, achieving state-of-the-art performance in bacterial genome analysis. This advancement highlights the potential of language models to revolutionize genome annotation, outperforming conventional tools and enabling more precise genetic insights.

查看原文本刊更多论文

基因组语言模型（gLMs）解码细菌基因组，以改进基因预测和翻译起始位点鉴定。

准确的细菌基因预测对了解微生物功能和推进生物技术至关重要。基于序列同源性和统计模型的传统方法往往难以处理复杂的遗传变异和新序列，因为它们解释“基因语言”的能力有限。为了克服这些挑战，我们探索基因组语言模型(gLMs)-受到自然语言处理中的大型语言模型的启发-以增强细菌基因预测。这些模型学习基因序列中的模式和上下文依赖关系，类似于llm处理人类语言的方式。我们使用转换器，特别是DNABERT，使用两阶段框架进行细菌基因预测：首先，识别编码序列（CDS）区域，然后通过识别正确的翻译起始位点（TIS）来改进预测。DNABERT是对一组精心策划的NCBI完整细菌基因组进行微调，使用k-mer标记器进行序列处理。我们的结果表明，GeneLM显著提高了基因预测的准确性。与领先的原核基因发现器Prodigal、GeneMark-HMM和Glimmer以及其他最近的深度学习方法相比，GeneLM减少了缺失的CDS预测，同时增加了匹配的注释。更值得注意的是，在经过实验验证的地点进行测试时，我们的TIS预测优于传统方法。GeneLM展示了glm在解码遗传信息方面的能力，在细菌基因组分析中实现了最先进的性能。这一进展突出了语言模型在基因组注释方面的潜力，超越了传统工具，并实现了更精确的遗传见解。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Briefings in bioinformatics 生物-生化研究方法

CiteScore

13.20

自引率

13.70%

发文量

549

审稿时长

6 months

期刊介绍： Briefings in Bioinformatics is an international journal serving as a platform for researchers and educators in the life sciences. It also appeals to mathematicians, statisticians, and computer scientists applying their expertise to biological challenges. The journal focuses on reviews tailored for users of databases and analytical tools in contemporary genetics, molecular and systems biology. It stands out by offering practical assistance and guidance to non-specialists in computerized methodologies. Covering a wide range from introductory concepts to specific protocols and analyses, the papers address bacterial, plant, fungal, animal, and human data. The journal's detailed subject areas include genetic studies of phenotypes and genotypes, mapping, DNA sequencing, expression profiling, gene expression studies, microarrays, alignment methods, protein profiles and HMMs, lipids, metabolic and signaling pathways, structure determination and function prediction, phylogenetic studies, and education and training.