Most Likely Sequence Generation for $n$-Grams, Transformers, HMMs, and Markov Chains, by Using Rollout Algorithms

arXiv - CS - Systems and Control Pub Date : 2024-03-19 DOI:arxiv-2403.15465

Yuchao Li, Dimitri Bertsekas

{"title":"Most Likely Sequence Generation for $n$-Grams, Transformers, HMMs, and Markov Chains, by Using Rollout Algorithms","authors":"Yuchao Li, Dimitri Bertsekas","doi":"arxiv-2403.15465","DOIUrl":null,"url":null,"abstract":"In this paper we consider a transformer with an $n$-gram structure, such as\nthe one underlying ChatGPT. The transformer provides next word probabilities,\nwhich can be used to generate word sequences. We consider methods for computing\nword sequences that are highly likely, based on these probabilities. Computing\nthe optimal (i.e., most likely) word sequence starting with a given initial\nstate is an intractable problem, so we propose methods to compute highly likely\nsequences of $N$ words in time that is a low order polynomial in $N$ and in the\nvocabulary size of the $n$-gram. These methods are based on the rollout\napproach from approximate dynamic programming, a form of single policy\niteration, which can improve the performance of any given heuristic policy. In\nour case we use a greedy heuristic that generates as next word one that has the\nhighest probability. We show with analysis, examples, and computational\nexperimentation that our methods are capable of generating highly likely\nsequences with a modest increase in computation over the greedy heuristic.\nWhile our analysis and experiments are focused on Markov chains of the type\narising in transformer and ChatGPT-like models, our methods apply to general\nfinite-state Markov chains, and related inference applications of Hidden Markov\nModels (HMM), where Viterbi decoding is used extensively.","PeriodicalId":501062,"journal":{"name":"arXiv - CS - Systems and Control","volume":"2016 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-03-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Systems and Control","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2403.15465","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

In this paper we consider a transformer with an $n$-gram structure, such as the one underlying ChatGPT. The transformer provides next word probabilities, which can be used to generate word sequences. We consider methods for computing word sequences that are highly likely, based on these probabilities. Computing the optimal (i.e., most likely) word sequence starting with a given initial state is an intractable problem, so we propose methods to compute highly likely sequences of $N$ words in time that is a low order polynomial in $N$ and in the vocabulary size of the $n$-gram. These methods are based on the rollout approach from approximate dynamic programming, a form of single policy iteration, which can improve the performance of any given heuristic policy. In our case we use a greedy heuristic that generates as next word one that has the highest probability. We show with analysis, examples, and computational experimentation that our methods are capable of generating highly likely sequences with a modest increase in computation over the greedy heuristic. While our analysis and experiments are focused on Markov chains of the type arising in transformer and ChatGPT-like models, our methods apply to general finite-state Markov chains, and related inference applications of Hidden Markov Models (HMM), where Viterbi decoding is used extensively.

查看原文本刊更多论文

利用滚动算法为 $n$ 格、变换器、HMM 和马尔可夫链生成最可能序列

在本文中，我们考虑了一种具有 $n$-gram 结构的转换器，如 ChatGPT 的基础结构。转换器提供下一个词的概率，可用于生成词序列。我们考虑了根据这些概率计算极有可能出现的词序列的方法。计算从给定初始状态开始的最优（即最可能）词序列是一个难以解决的问题，因此我们提出了计算 $N$ 词的高可能性序列的方法，计算时间是 $N$ 和 $n$-gram 词汇量大小的低阶多项式。这些方法基于近似动态编程的滚动方法，这是一种单一策略迭代形式，可以提高任何给定启发式策略的性能。在我们的案例中，我们使用了一种贪婪启发式，它能生成概率最高的下一个词。我们通过分析、举例和计算实验表明，与贪婪启发式相比，我们的方法只需适度增加计算量，就能生成高概率序列。虽然我们的分析和实验侧重于变压器和类似 ChatGPT 模型中出现的马尔可夫链类型，但我们的方法也适用于一般无限状态马尔可夫链，以及广泛使用维特比解码的隐马尔可夫模型（HMM）的相关推理应用。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

arXiv - CS - Systems and Control

自引率

0.00%

发文量