{"title":"基于k-mer标记化策略的植物基因组语言模型及其调控元件强度预测。","authors":"Shosuke Suzuki, Kazumasa Horie, Toshiyuki Amagasa, Naoya Fukuda","doi":"10.1007/s11103-025-01604-7","DOIUrl":null,"url":null,"abstract":"<p><p>Recent advances in genomic language models have improved the accuracy of in silico analyses, yet many rely on resource-intensive architectures. In this study, we focus on the impact of k-mer tokenization strategies-specifically varying window sizes (three to eight) and overlap schemes-on the performance of transformer-based genomic language models. Through extensive evaluation across multiple plant genomic tasks, including splice site and alternative polyadenylation site prediction, we show that thoughtful design of the k-mer tokenizer plays a critical role in model performance, often outweighing model scale. In particular, overlap-based tokenization generally enhances performance by preserving local sequence context, while certain non-overlap configurations achieve competitive accuracy with improved computational efficiency in some tasks. Despite using a smaller model, our approach performs on par with the state-of-the-art AgroNT model in many cases. These results emphasize that k-mer tokenization, not merely model size, is a key determinant of success in genomic sequence modeling. Our findings provide practical guidance for designing efficient genomic language models tailored to plant biology.</p>","PeriodicalId":20064,"journal":{"name":"Plant Molecular Biology","volume":"115 4","pages":"100"},"PeriodicalIF":3.8000,"publicationDate":"2025-07-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12313756/pdf/","citationCount":"0","resultStr":"{\"title\":\"Genomic language models with k-mer tokenization strategies for plant genome annotation and regulatory element strength prediction.\",\"authors\":\"Shosuke Suzuki, Kazumasa Horie, Toshiyuki Amagasa, Naoya Fukuda\",\"doi\":\"10.1007/s11103-025-01604-7\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p><p>Recent advances in genomic language models have improved the accuracy of in silico analyses, yet many rely on resource-intensive architectures. In this study, we focus on the impact of k-mer tokenization strategies-specifically varying window sizes (three to eight) and overlap schemes-on the performance of transformer-based genomic language models. Through extensive evaluation across multiple plant genomic tasks, including splice site and alternative polyadenylation site prediction, we show that thoughtful design of the k-mer tokenizer plays a critical role in model performance, often outweighing model scale. In particular, overlap-based tokenization generally enhances performance by preserving local sequence context, while certain non-overlap configurations achieve competitive accuracy with improved computational efficiency in some tasks. Despite using a smaller model, our approach performs on par with the state-of-the-art AgroNT model in many cases. These results emphasize that k-mer tokenization, not merely model size, is a key determinant of success in genomic sequence modeling. Our findings provide practical guidance for designing efficient genomic language models tailored to plant biology.</p>\",\"PeriodicalId\":20064,\"journal\":{\"name\":\"Plant Molecular Biology\",\"volume\":\"115 4\",\"pages\":\"100\"},\"PeriodicalIF\":3.8000,\"publicationDate\":\"2025-07-31\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12313756/pdf/\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Plant Molecular Biology\",\"FirstCategoryId\":\"99\",\"ListUrlMain\":\"https://doi.org/10.1007/s11103-025-01604-7\",\"RegionNum\":2,\"RegionCategory\":\"生物学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"BIOCHEMISTRY & MOLECULAR BIOLOGY\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Plant Molecular Biology","FirstCategoryId":"99","ListUrlMain":"https://doi.org/10.1007/s11103-025-01604-7","RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"BIOCHEMISTRY & MOLECULAR BIOLOGY","Score":null,"Total":0}
Genomic language models with k-mer tokenization strategies for plant genome annotation and regulatory element strength prediction.
Recent advances in genomic language models have improved the accuracy of in silico analyses, yet many rely on resource-intensive architectures. In this study, we focus on the impact of k-mer tokenization strategies-specifically varying window sizes (three to eight) and overlap schemes-on the performance of transformer-based genomic language models. Through extensive evaluation across multiple plant genomic tasks, including splice site and alternative polyadenylation site prediction, we show that thoughtful design of the k-mer tokenizer plays a critical role in model performance, often outweighing model scale. In particular, overlap-based tokenization generally enhances performance by preserving local sequence context, while certain non-overlap configurations achieve competitive accuracy with improved computational efficiency in some tasks. Despite using a smaller model, our approach performs on par with the state-of-the-art AgroNT model in many cases. These results emphasize that k-mer tokenization, not merely model size, is a key determinant of success in genomic sequence modeling. Our findings provide practical guidance for designing efficient genomic language models tailored to plant biology.
期刊介绍:
Plant Molecular Biology is an international journal dedicated to rapid publication of original research articles in all areas of plant biology.The Editorial Board welcomes full-length manuscripts that address important biological problems of broad interest, including research in comparative genomics, functional genomics, proteomics, bioinformatics, computational biology, biochemical and regulatory networks, and biotechnology. Because space in the journal is limited, however, preference is given to publication of results that provide significant new insights into biological problems and that advance the understanding of structure, function, mechanisms, or regulation. Authors must ensure that results are of high quality and that manuscripts are written for a broad plant science audience.