基于注意力和状态空间的基因组语言模型中标记化影响的比较

LeAnn M Lindsey, Nicole L Pershing, Anisa Habib, W. Zac Stephens, Anne J Blaschke, Hari Sundar
{"title":"基于注意力和状态空间的基因组语言模型中标记化影响的比较","authors":"LeAnn M Lindsey, Nicole L Pershing, Anisa Habib, W. Zac Stephens, Anne J Blaschke, Hari Sundar","doi":"10.1101/2024.09.09.612081","DOIUrl":null,"url":null,"abstract":"Genomic language models have recently emerged as powerful tools to decode and interpret genetic sequences. Existing genomic language models have utilized various tokenization methods including character tokenization, overlapping and non-overlapping k-mer tokenization, and byte-pair encoding, a method widely used in natural language models. Genomic models have significant differences from natural language and protein language models because of their low character variability, complex and overlapping features, and inconsistent directionality. These differences make sub-word tokenization in genomic language models significantly different from traditional language models. This study explores the impact of tokenization in attention-based and state-space genomic language models by evaluating their downstream performance on various fine-tuning tasks. We propose new definitions for fertility, the token per word ratio, in the context of genomic language models, and introduce tokenization parity, which measures how consistently a tokenizer parses homologous sequences. We also perform an ablation study on the state-space model, Mamba, to evaluate the impact of character-based tokenization compared to byte-pair encoding. Our results indicate that the choice of tokenizer significantly impacts model performance and that when experiments control for input sequence length, character tokenization is the best choice in state-space models for all evaluated task categories except epigenetic mark prediction.","PeriodicalId":501307,"journal":{"name":"bioRxiv - Bioinformatics","volume":"24 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-09-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"A Comparison of Tokenization Impact in Attention Based and State Space Genomic Language Models\",\"authors\":\"LeAnn M Lindsey, Nicole L Pershing, Anisa Habib, W. Zac Stephens, Anne J Blaschke, Hari Sundar\",\"doi\":\"10.1101/2024.09.09.612081\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Genomic language models have recently emerged as powerful tools to decode and interpret genetic sequences. Existing genomic language models have utilized various tokenization methods including character tokenization, overlapping and non-overlapping k-mer tokenization, and byte-pair encoding, a method widely used in natural language models. Genomic models have significant differences from natural language and protein language models because of their low character variability, complex and overlapping features, and inconsistent directionality. These differences make sub-word tokenization in genomic language models significantly different from traditional language models. This study explores the impact of tokenization in attention-based and state-space genomic language models by evaluating their downstream performance on various fine-tuning tasks. We propose new definitions for fertility, the token per word ratio, in the context of genomic language models, and introduce tokenization parity, which measures how consistently a tokenizer parses homologous sequences. We also perform an ablation study on the state-space model, Mamba, to evaluate the impact of character-based tokenization compared to byte-pair encoding. Our results indicate that the choice of tokenizer significantly impacts model performance and that when experiments control for input sequence length, character tokenization is the best choice in state-space models for all evaluated task categories except epigenetic mark prediction.\",\"PeriodicalId\":501307,\"journal\":{\"name\":\"bioRxiv - Bioinformatics\",\"volume\":\"24 1\",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-09-17\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"bioRxiv - Bioinformatics\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1101/2024.09.09.612081\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"bioRxiv - Bioinformatics","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1101/2024.09.09.612081","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

摘要

基因组语言模型近来已成为解码和解释基因序列的强大工具。现有的基因组语言模型采用了各种标记化方法,包括字符标记化、重叠和非重叠 k-mer 标记化以及字节对编码(一种广泛用于自然语言模型的方法)。基因组模型与自然语言和蛋白质语言模型有很大不同,因为它们的字符变异性低、特征复杂且相互重叠、方向性不一致。这些差异使得基因组语言模型中的子词标记化与传统语言模型有很大不同。本研究通过评估基于注意力的基因组语言模型和状态空间基因组语言模型在各种微调任务中的下游性能,探讨了标记化对它们的影响。我们为基因组语言模型中的 "生育率"(token per word ratio)提出了新的定义,并引入了标记化奇偶性(tokenization parity),以衡量标记化器解析同源序列的一致性。我们还对状态空间模型 Mamba 进行了消融研究,以评估基于字符的标记化与字节对编码相比所产生的影响。我们的结果表明,标记化器的选择对模型性能有显著影响,而且当实验控制输入序列长度时,字符标记化是状态空间模型中除表观遗传标记预测外所有评估任务类别的最佳选择。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
A Comparison of Tokenization Impact in Attention Based and State Space Genomic Language Models
Genomic language models have recently emerged as powerful tools to decode and interpret genetic sequences. Existing genomic language models have utilized various tokenization methods including character tokenization, overlapping and non-overlapping k-mer tokenization, and byte-pair encoding, a method widely used in natural language models. Genomic models have significant differences from natural language and protein language models because of their low character variability, complex and overlapping features, and inconsistent directionality. These differences make sub-word tokenization in genomic language models significantly different from traditional language models. This study explores the impact of tokenization in attention-based and state-space genomic language models by evaluating their downstream performance on various fine-tuning tasks. We propose new definitions for fertility, the token per word ratio, in the context of genomic language models, and introduce tokenization parity, which measures how consistently a tokenizer parses homologous sequences. We also perform an ablation study on the state-space model, Mamba, to evaluate the impact of character-based tokenization compared to byte-pair encoding. Our results indicate that the choice of tokenizer significantly impacts model performance and that when experiments control for input sequence length, character tokenization is the best choice in state-space models for all evaluated task categories except epigenetic mark prediction.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信