标记器选择对基因组语言模型的影响。

IF 5.4
LeAnn M Lindsey, Nicole L Pershing, Anisa Habib, Keith Dufault-Thompson, W Zac Stephens, Anne J Blaschke, Xiaofang Jiang, Hari Sundar
{"title":"标记器选择对基因组语言模型的影响。","authors":"LeAnn M Lindsey, Nicole L Pershing, Anisa Habib, Keith Dufault-Thompson, W Zac Stephens, Anne J Blaschke, Xiaofang Jiang, Hari Sundar","doi":"10.1093/bioinformatics/btaf456","DOIUrl":null,"url":null,"abstract":"<p><strong>Motivation: </strong>Genomic language models have recently emerged as a new method to decode, interpret, and generate genetic sequences. Existing genomic language models have utilized various tokenization methods, including character tokenization, overlapping and nonoverlapping k-mer tokenization, and byte-pair encoding, a method widely used in natural language models. Genomic sequences differ from natural language because of their low character variability, complex and overlapping features, and inconsistent directionality. These features make subword tokenization in genomic language models significantly different from both traditional language models and protein language models.</p><p><strong>Results: </strong>This study explores the impact of tokenization in genomic language models by evaluating their downstream performance on 44 classification fine-tuning tasks. We also perform a direct comparison of byte pair encoding and character tokenization in Mamba, a state-space model. Our results indicate that character tokenization outperforms subword tokenization methods on tasks that rely on nucleotide-level resolution, such as splice site prediction and promoter detection. While byte-pair tokenization had stronger performance on the SARS-CoV-2 variant classification task, we observed limited statistically significant differences between tokenization methods on the remaining downstream tasks.</p><p><strong>Availability and implementation: </strong>Detailed results of all benchmarking experiments are available in https://github.com/leannmlindsey/DNAtokenization. Training datasets and pretrained models are available at https://huggingface.co/datasets/leannmlindsey. Datasets and processing scripts are available at doi: 10.5281/zenodo.16287401 and doi: 10.5281/zenodo.16287130.</p>","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":" ","pages":""},"PeriodicalIF":5.4000,"publicationDate":"2025-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12453675/pdf/","citationCount":"0","resultStr":"{\"title\":\"The impact of tokenizer selection in genomic language models.\",\"authors\":\"LeAnn M Lindsey, Nicole L Pershing, Anisa Habib, Keith Dufault-Thompson, W Zac Stephens, Anne J Blaschke, Xiaofang Jiang, Hari Sundar\",\"doi\":\"10.1093/bioinformatics/btaf456\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p><strong>Motivation: </strong>Genomic language models have recently emerged as a new method to decode, interpret, and generate genetic sequences. Existing genomic language models have utilized various tokenization methods, including character tokenization, overlapping and nonoverlapping k-mer tokenization, and byte-pair encoding, a method widely used in natural language models. Genomic sequences differ from natural language because of their low character variability, complex and overlapping features, and inconsistent directionality. These features make subword tokenization in genomic language models significantly different from both traditional language models and protein language models.</p><p><strong>Results: </strong>This study explores the impact of tokenization in genomic language models by evaluating their downstream performance on 44 classification fine-tuning tasks. We also perform a direct comparison of byte pair encoding and character tokenization in Mamba, a state-space model. Our results indicate that character tokenization outperforms subword tokenization methods on tasks that rely on nucleotide-level resolution, such as splice site prediction and promoter detection. While byte-pair tokenization had stronger performance on the SARS-CoV-2 variant classification task, we observed limited statistically significant differences between tokenization methods on the remaining downstream tasks.</p><p><strong>Availability and implementation: </strong>Detailed results of all benchmarking experiments are available in https://github.com/leannmlindsey/DNAtokenization. Training datasets and pretrained models are available at https://huggingface.co/datasets/leannmlindsey. Datasets and processing scripts are available at doi: 10.5281/zenodo.16287401 and doi: 10.5281/zenodo.16287130.</p>\",\"PeriodicalId\":93899,\"journal\":{\"name\":\"Bioinformatics (Oxford, England)\",\"volume\":\" \",\"pages\":\"\"},\"PeriodicalIF\":5.4000,\"publicationDate\":\"2025-09-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12453675/pdf/\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Bioinformatics (Oxford, England)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1093/bioinformatics/btaf456\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Bioinformatics (Oxford, England)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1093/bioinformatics/btaf456","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

摘要

动机:基因组语言模型是最近出现的一种解码、解释和生成基因序列的新方法。现有的基因组语言模型采用了多种标记方法,包括字符标记法、重叠和非重叠k-mer标记法以及自然语言模型中广泛使用的字节对编码方法。基因组序列不同于自然语言,因为它们的低特征可变性,复杂和重叠的特征,以及不一致的方向性。这些特征使得基因组语言模型中的子词标记化与传统语言模型和蛋白质语言模型都有很大的不同。结果:本研究通过评估基因组语言模型在44个分类微调任务中的下游表现,探讨了标记化对基因组语言模型的影响。我们还对状态空间模型Mamba中的字节对编码和字符标记化进行了直接比较。我们的研究结果表明,字符标记在依赖于核苷酸水平分辨率的任务上优于子词标记方法,如剪接位点预测和启动子检测。虽然字节对标记化在SARS-CoV-2变体分类任务上表现较好,但我们观察到在其余下游任务上,标记化方法之间的统计学差异有限。可用性和实施:所有基准测试实验的详细结果可在https://github.com/leannmlindsey/DNAtokenization中获得。训练数据集和预训练模型可在https://huggingface.co/datasets/leannmlindsey上获得。数据集和处理脚本可从doi: 10.5281/zenodo获得。doi: 10.5281/zenodo.16287130。补充资料:补充资料可在生物信息学网站获得。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
The impact of tokenizer selection in genomic language models.

Motivation: Genomic language models have recently emerged as a new method to decode, interpret, and generate genetic sequences. Existing genomic language models have utilized various tokenization methods, including character tokenization, overlapping and nonoverlapping k-mer tokenization, and byte-pair encoding, a method widely used in natural language models. Genomic sequences differ from natural language because of their low character variability, complex and overlapping features, and inconsistent directionality. These features make subword tokenization in genomic language models significantly different from both traditional language models and protein language models.

Results: This study explores the impact of tokenization in genomic language models by evaluating their downstream performance on 44 classification fine-tuning tasks. We also perform a direct comparison of byte pair encoding and character tokenization in Mamba, a state-space model. Our results indicate that character tokenization outperforms subword tokenization methods on tasks that rely on nucleotide-level resolution, such as splice site prediction and promoter detection. While byte-pair tokenization had stronger performance on the SARS-CoV-2 variant classification task, we observed limited statistically significant differences between tokenization methods on the remaining downstream tasks.

Availability and implementation: Detailed results of all benchmarking experiments are available in https://github.com/leannmlindsey/DNAtokenization. Training datasets and pretrained models are available at https://huggingface.co/datasets/leannmlindsey. Datasets and processing scripts are available at doi: 10.5281/zenodo.16287401 and doi: 10.5281/zenodo.16287130.

求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信