LeAnn M Lindsey, Nicole L Pershing, Anisa Habib, Keith Dufault-Thompson, W Zac Stephens, Anne J Blaschke, Xiaofang Jiang, Hari Sundar
{"title":"标记器选择对基因组语言模型的影响。","authors":"LeAnn M Lindsey, Nicole L Pershing, Anisa Habib, Keith Dufault-Thompson, W Zac Stephens, Anne J Blaschke, Xiaofang Jiang, Hari Sundar","doi":"10.1093/bioinformatics/btaf456","DOIUrl":null,"url":null,"abstract":"<p><strong>Motivation: </strong>Genomic language models have recently emerged as a new method to decode, interpret, and generate genetic sequences. Existing genomic language models have utilized various tokenization methods, including character tokenization, overlapping and nonoverlapping k-mer tokenization, and byte-pair encoding, a method widely used in natural language models. Genomic sequences differ from natural language because of their low character variability, complex and overlapping features, and inconsistent directionality. These features make subword tokenization in genomic language models significantly different from both traditional language models and protein language models.</p><p><strong>Results: </strong>This study explores the impact of tokenization in genomic language models by evaluating their downstream performance on 44 classification fine-tuning tasks. We also perform a direct comparison of byte pair encoding and character tokenization in Mamba, a state-space model. Our results indicate that character tokenization outperforms subword tokenization methods on tasks that rely on nucleotide-level resolution, such as splice site prediction and promoter detection. While byte-pair tokenization had stronger performance on the SARS-CoV-2 variant classification task, we observed limited statistically significant differences between tokenization methods on the remaining downstream tasks.</p><p><strong>Availability and implementation: </strong>Detailed results of all benchmarking experiments are available in https://github.com/leannmlindsey/DNAtokenization. Training datasets and pretrained models are available at https://huggingface.co/datasets/leannmlindsey. Datasets and processing scripts are available at doi: 10.5281/zenodo.16287401 and doi: 10.5281/zenodo.16287130.</p>","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":" ","pages":""},"PeriodicalIF":5.4000,"publicationDate":"2025-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12453675/pdf/","citationCount":"0","resultStr":"{\"title\":\"The impact of tokenizer selection in genomic language models.\",\"authors\":\"LeAnn M Lindsey, Nicole L Pershing, Anisa Habib, Keith Dufault-Thompson, W Zac Stephens, Anne J Blaschke, Xiaofang Jiang, Hari Sundar\",\"doi\":\"10.1093/bioinformatics/btaf456\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p><strong>Motivation: </strong>Genomic language models have recently emerged as a new method to decode, interpret, and generate genetic sequences. Existing genomic language models have utilized various tokenization methods, including character tokenization, overlapping and nonoverlapping k-mer tokenization, and byte-pair encoding, a method widely used in natural language models. Genomic sequences differ from natural language because of their low character variability, complex and overlapping features, and inconsistent directionality. These features make subword tokenization in genomic language models significantly different from both traditional language models and protein language models.</p><p><strong>Results: </strong>This study explores the impact of tokenization in genomic language models by evaluating their downstream performance on 44 classification fine-tuning tasks. We also perform a direct comparison of byte pair encoding and character tokenization in Mamba, a state-space model. Our results indicate that character tokenization outperforms subword tokenization methods on tasks that rely on nucleotide-level resolution, such as splice site prediction and promoter detection. While byte-pair tokenization had stronger performance on the SARS-CoV-2 variant classification task, we observed limited statistically significant differences between tokenization methods on the remaining downstream tasks.</p><p><strong>Availability and implementation: </strong>Detailed results of all benchmarking experiments are available in https://github.com/leannmlindsey/DNAtokenization. Training datasets and pretrained models are available at https://huggingface.co/datasets/leannmlindsey. Datasets and processing scripts are available at doi: 10.5281/zenodo.16287401 and doi: 10.5281/zenodo.16287130.</p>\",\"PeriodicalId\":93899,\"journal\":{\"name\":\"Bioinformatics (Oxford, England)\",\"volume\":\" \",\"pages\":\"\"},\"PeriodicalIF\":5.4000,\"publicationDate\":\"2025-09-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12453675/pdf/\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Bioinformatics (Oxford, England)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1093/bioinformatics/btaf456\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Bioinformatics (Oxford, England)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1093/bioinformatics/btaf456","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
The impact of tokenizer selection in genomic language models.
Motivation: Genomic language models have recently emerged as a new method to decode, interpret, and generate genetic sequences. Existing genomic language models have utilized various tokenization methods, including character tokenization, overlapping and nonoverlapping k-mer tokenization, and byte-pair encoding, a method widely used in natural language models. Genomic sequences differ from natural language because of their low character variability, complex and overlapping features, and inconsistent directionality. These features make subword tokenization in genomic language models significantly different from both traditional language models and protein language models.
Results: This study explores the impact of tokenization in genomic language models by evaluating their downstream performance on 44 classification fine-tuning tasks. We also perform a direct comparison of byte pair encoding and character tokenization in Mamba, a state-space model. Our results indicate that character tokenization outperforms subword tokenization methods on tasks that rely on nucleotide-level resolution, such as splice site prediction and promoter detection. While byte-pair tokenization had stronger performance on the SARS-CoV-2 variant classification task, we observed limited statistically significant differences between tokenization methods on the remaining downstream tasks.
Availability and implementation: Detailed results of all benchmarking experiments are available in https://github.com/leannmlindsey/DNAtokenization. Training datasets and pretrained models are available at https://huggingface.co/datasets/leannmlindsey. Datasets and processing scripts are available at doi: 10.5281/zenodo.16287401 and doi: 10.5281/zenodo.16287130.