基于k-mer标记化策略的植物基因组语言模型及其调控元件强度预测。

IF 3.8 2区生物学 Q2 BIOCHEMISTRY & MOLECULAR BIOLOGY

Plant Molecular Biology Pub Date : 2025-07-31 DOI:10.1007/s11103-025-01604-7

Shosuke Suzuki, Kazumasa Horie, Toshiyuki Amagasa, Naoya Fukuda

{"title":"基于k-mer标记化策略的植物基因组语言模型及其调控元件强度预测。","authors":"Shosuke Suzuki, Kazumasa Horie, Toshiyuki Amagasa, Naoya Fukuda","doi":"10.1007/s11103-025-01604-7","DOIUrl":null,"url":null,"abstract":"Recent advances in genomic language models have improved the accuracy of in silico analyses, yet many rely on resource-intensive architectures. In this study, we focus on the impact of k-mer tokenization strategies-specifically varying window sizes (three to eight) and overlap schemes-on the performance of transformer-based genomic language models. Through extensive evaluation across multiple plant genomic tasks, including splice site and alternative polyadenylation site prediction, we show that thoughtful design of the k-mer tokenizer plays a critical role in model performance, often outweighing model scale. In particular, overlap-based tokenization generally enhances performance by preserving local sequence context, while certain non-overlap configurations achieve competitive accuracy with improved computational efficiency in some tasks. Despite using a smaller model, our approach performs on par with the state-of-the-art AgroNT model in many cases. These results emphasize that k-mer tokenization, not merely model size, is a key determinant of success in genomic sequence modeling. Our findings provide practical guidance for designing efficient genomic language models tailored to plant biology.","PeriodicalId":20064,"journal":{"name":"Plant Molecular Biology","volume":"115 4","pages":"100"},"PeriodicalIF":3.8000,"publicationDate":"2025-07-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12313756/pdf/","citationCount":"0","resultStr":"{\"title\":\"Genomic language models with k-mer tokenization strategies for plant genome annotation and regulatory element strength prediction.\",\"authors\":\"Shosuke Suzuki, Kazumasa Horie, Toshiyuki Amagasa, Naoya Fukuda\",\"doi\":\"10.1007/s11103-025-01604-7\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Recent advances in genomic language models have improved the accuracy of in silico analyses, yet many rely on resource-intensive architectures. In this study, we focus on the impact of k-mer tokenization strategies-specifically varying window sizes (three to eight) and overlap schemes-on the performance of transformer-based genomic language models. Through extensive evaluation across multiple plant genomic tasks, including splice site and alternative polyadenylation site prediction, we show that thoughtful design of the k-mer tokenizer plays a critical role in model performance, often outweighing model scale. In particular, overlap-based tokenization generally enhances performance by preserving local sequence context, while certain non-overlap configurations achieve competitive accuracy with improved computational efficiency in some tasks. Despite using a smaller model, our approach performs on par with the state-of-the-art AgroNT model in many cases. These results emphasize that k-mer tokenization, not merely model size, is a key determinant of success in genomic sequence modeling. Our findings provide practical guidance for designing efficient genomic language models tailored to plant biology.\",\"PeriodicalId\":20064,\"journal\":{\"name\":\"Plant Molecular Biology\",\"volume\":\"115 4\",\"pages\":\"100\"},\"PeriodicalIF\":3.8000,\"publicationDate\":\"2025-07-31\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12313756/pdf/\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Plant Molecular Biology\",\"FirstCategoryId\":\"99\",\"ListUrlMain\":\"https://doi.org/10.1007/s11103-025-01604-7\",\"RegionNum\":2,\"RegionCategory\":\"生物学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"BIOCHEMISTRY & MOLECULAR BIOLOGY\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Plant Molecular Biology","FirstCategoryId":"99","ListUrlMain":"https://doi.org/10.1007/s11103-025-01604-7","RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"BIOCHEMISTRY & MOLECULAR BIOLOGY","Score":null,"Total":0}

引用次数: 0

摘要

基因组语言模型的最新进展提高了计算机分析的准确性，但许多依赖于资源密集型架构。在这项研究中，我们关注k-mer标记化策略的影响，特别是不同的窗口大小（3到8）和重叠方案，对基于转换器的基因组语言模型的性能的影响。通过对多个植物基因组任务的广泛评估，包括剪接位点和替代聚腺苷酸化位点预测，我们表明k-mer标记器的深思熟虑设计在模型性能中起着关键作用，通常超过模型规模。特别是，基于重叠的标记化通常通过保留局部序列上下文来提高性能，而某些非重叠配置在某些任务中通过提高计算效率来获得竞争精度。尽管使用较小的模型，我们的方法在许多情况下与最先进的AgroNT模型相当。这些结果强调k-mer标记化，而不仅仅是模型大小，是基因组序列建模成功的关键决定因素。我们的发现为设计适合植物生物学的高效基因组语言模型提供了实用指导。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Genomic language models with k-mer tokenization strategies for plant genome annotation and regulatory element strength prediction.

Recent advances in genomic language models have improved the accuracy of in silico analyses, yet many rely on resource-intensive architectures. In this study, we focus on the impact of k-mer tokenization strategies-specifically varying window sizes (three to eight) and overlap schemes-on the performance of transformer-based genomic language models. Through extensive evaluation across multiple plant genomic tasks, including splice site and alternative polyadenylation site prediction, we show that thoughtful design of the k-mer tokenizer plays a critical role in model performance, often outweighing model scale. In particular, overlap-based tokenization generally enhances performance by preserving local sequence context, while certain non-overlap configurations achieve competitive accuracy with improved computational efficiency in some tasks. Despite using a smaller model, our approach performs on par with the state-of-the-art AgroNT model in many cases. These results emphasize that k-mer tokenization, not merely model size, is a key determinant of success in genomic sequence modeling. Our findings provide practical guidance for designing efficient genomic language models tailored to plant biology.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Plant Molecular Biology 生物-生化与分子生物学

自引率

2.00%

发文量

审稿时长

1.4 months

期刊介绍： Plant Molecular Biology is an international journal dedicated to rapid publication of original research articles in all areas of plant biology.The Editorial Board welcomes full-length manuscripts that address important biological problems of broad interest, including research in comparative genomics, functional genomics, proteomics, bioinformatics, computational biology, biochemical and regulatory networks, and biotechnology. Because space in the journal is limited, however, preference is given to publication of results that provide significant new insights into biological problems and that advance the understanding of structure, function, mechanisms, or regulation. Authors must ensure that results are of high quality and that manuscripts are written for a broad plant science audience.