Distinguishing word identity and sequence context in DNA language models

IF 2.9 3区生物学 Q2 BIOCHEMICAL RESEARCH METHODS

BMC Bioinformatics Pub Date : 2024-09-13 DOI:10.1186/s12859-024-05869-5

Melissa Sanabria, Jonas Hirsch, Anna R. Poetsch

{"title":"Distinguishing word identity and sequence context in DNA language models","authors":"Melissa Sanabria, Jonas Hirsch, Anna R. Poetsch","doi":"10.1186/s12859-024-05869-5","DOIUrl":null,"url":null,"abstract":"Transformer-based large language models (LLMs) are very suited for biological sequence data, because of analogies to natural language. Complex relationships can be learned, because a concept of \"words\" can be generated through tokenization. Training the models with masked token prediction, they learn both token sequence identity and larger sequence context. We developed methodology to interrogate model learning, which is both relevant for the interpretability of the model and to evaluate its potential for specific tasks. We used DNABERT, a DNA language model trained on the human genome with overlapping k-mers as tokens. To gain insight into the model′s learning, we interrogated how the model performs predictions, extracted token embeddings, and defined a fine-tuning benchmarking task to predict the next tokens of different sizes without overlaps. This task evaluates foundation models without interrogating specific genome biology, it does not depend on tokenization strategies, vocabulary size, the dictionary, or the number of training parameters. Lastly, there is no leakage of information from token identity into the prediction task, which makes it particularly useful to evaluate the learning of sequence context. We discovered that the model with overlapping k-mers struggles to learn larger sequence context. Instead, the learned embeddings largely represent token sequence. Still, good performance is achieved for genome-biology-inspired fine-tuning tasks. Models with overlapping tokens may be used for tasks where a larger sequence context is of less relevance, but the token sequence directly represents the desired learning features. This emphasizes the need to interrogate knowledge representation in biological LLMs.","PeriodicalId":8958,"journal":{"name":"BMC Bioinformatics","volume":"37 1","pages":""},"PeriodicalIF":2.9000,"publicationDate":"2024-09-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"BMC Bioinformatics","FirstCategoryId":"99","ListUrlMain":"https://doi.org/10.1186/s12859-024-05869-5","RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"BIOCHEMICAL RESEARCH METHODS","Score":null,"Total":0}

引用次数: 0

Abstract

Transformer-based large language models (LLMs) are very suited for biological sequence data, because of analogies to natural language. Complex relationships can be learned, because a concept of "words" can be generated through tokenization. Training the models with masked token prediction, they learn both token sequence identity and larger sequence context. We developed methodology to interrogate model learning, which is both relevant for the interpretability of the model and to evaluate its potential for specific tasks. We used DNABERT, a DNA language model trained on the human genome with overlapping k-mers as tokens. To gain insight into the model′s learning, we interrogated how the model performs predictions, extracted token embeddings, and defined a fine-tuning benchmarking task to predict the next tokens of different sizes without overlaps. This task evaluates foundation models without interrogating specific genome biology, it does not depend on tokenization strategies, vocabulary size, the dictionary, or the number of training parameters. Lastly, there is no leakage of information from token identity into the prediction task, which makes it particularly useful to evaluate the learning of sequence context. We discovered that the model with overlapping k-mers struggles to learn larger sequence context. Instead, the learned embeddings largely represent token sequence. Still, good performance is achieved for genome-biology-inspired fine-tuning tasks. Models with overlapping tokens may be used for tasks where a larger sequence context is of less relevance, but the token sequence directly represents the desired learning features. This emphasizes the need to interrogate knowledge representation in biological LLMs.

查看原文本刊更多论文

在 DNA 语言模型中区分单词特征和序列上下文

基于变换器的大型语言模型（LLM）非常适合生物序列数据，因为它与自然语言类似。由于可以通过标记化生成 "词 "的概念，因此可以学习复杂的关系。通过屏蔽标记预测来训练模型，它们既能学习标记序列标识，也能学习更大的序列上下文。我们开发了对模型学习进行检验的方法，这既关系到模型的可解释性，也关系到评估其在特定任务中的潜力。我们使用了 DNABERT，这是一个在人类基因组上训练的 DNA 语言模型，以重叠的 k-mers 作为标记。为了深入了解该模型的学习情况，我们询问了该模型是如何进行预测的，提取了标记嵌入，并定义了一个微调基准任务，以预测下一个不同大小、无重叠的标记。这项任务无需询问具体的基因组生物学信息即可评估基础模型，它不依赖于标记化策略、词汇量大小、字典或训练参数的数量。最后，标记身份信息不会泄漏到预测任务中，这使得它对评估序列上下文的学习特别有用。我们发现，具有重叠 k-mers 的模型在学习更大的序列上下文时非常吃力。相反，学习到的嵌入在很大程度上代表了标记序列。不过，在基因组生物学启发的微调任务中，我们还是取得了不错的成绩。具有重叠标记的模型可用于较大序列上下文相关性较低的任务，但标记序列直接代表了所需的学习特征。这就强调了对生物 LLM 中的知识表征进行研究的必要性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

BMC Bioinformatics 生物-生化研究方法

CiteScore

5.70

自引率

3.30%

发文量

506

审稿时长

4.3 months

期刊介绍： BMC Bioinformatics is an open access, peer-reviewed journal that considers articles on all aspects of the development, testing and novel application of computational and statistical methods for the modeling and analysis of all kinds of biological data, as well as other areas of computational biology. BMC Bioinformatics is part of the BMC series which publishes subject-specific journals focused on the needs of individual research communities across all areas of biology and medicine. We offer an efficient, fair and friendly peer review service, and are committed to publishing all sound science, provided that there is some advance in knowledge presented by the work.