Should you marginalize over possible tokenizations?

Annual Meeting of the Association for Computational Linguistics Pub Date : 2023-06-30 DOI:10.48550/arXiv.2306.17757

N. Chirkova, Germán Kruszewski, Jos Rozen, Marc Dymetman

引用次数: 1

Abstract

Autoregressive language models (LMs) map token sequences to probabilities. The usual practice for computing the probability of any character string (e.g. English sentences) is to first transform it into a sequence of tokens that is scored by the model. However, there are exponentially many token sequences that represent any given string. To truly compute the probability of a string one should marginalize over all tokenizations, which is typically intractable. Here, we analyze whether the practice of ignoring the marginalization is justified. To this end, we devise an importance-sampling-based algorithm that allows us to compute estimates of the marginal probabilities and compare them to the default procedure in a range of state-of-the-art models and datasets. Our results show that the gap in log-likelihood is no larger than 0.5% in most cases, but that it becomes more pronounced for data with long complex words.

查看原文本刊更多论文

你应该忽略可能的标记化吗?

自回归语言模型(LMs)将标记序列映射到概率。计算任何字符串(例如英语句子)的概率的通常做法是首先将其转换为由模型评分的标记序列。然而，表示任何给定字符串的记号序列呈指数级增长。要真正计算字符串的概率，应该在所有标记化中边缘化，这通常是难以处理的。在这里，我们分析忽视边缘化的做法是否合理。为此，我们设计了一种基于重要性抽样的算法，该算法允许我们计算边际概率的估计值，并将其与一系列最先进的模型和数据集中的默认过程进行比较。我们的结果表明，在大多数情况下，对数似然的差距不大于0.5%，但对于具有长复杂单词的数据，这一差距变得更加明显。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Annual Meeting of the Association for Computational Linguistics

自引率

0.00%

发文量