{"title":"Most Likely Sequence Generation for $n$-Grams, Transformers, HMMs, and Markov Chains, by Using Rollout Algorithms","authors":"Yuchao Li, Dimitri Bertsekas","doi":"arxiv-2403.15465","DOIUrl":null,"url":null,"abstract":"In this paper we consider a transformer with an $n$-gram structure, such as\nthe one underlying ChatGPT. The transformer provides next word probabilities,\nwhich can be used to generate word sequences. We consider methods for computing\nword sequences that are highly likely, based on these probabilities. Computing\nthe optimal (i.e., most likely) word sequence starting with a given initial\nstate is an intractable problem, so we propose methods to compute highly likely\nsequences of $N$ words in time that is a low order polynomial in $N$ and in the\nvocabulary size of the $n$-gram. These methods are based on the rollout\napproach from approximate dynamic programming, a form of single policy\niteration, which can improve the performance of any given heuristic policy. In\nour case we use a greedy heuristic that generates as next word one that has the\nhighest probability. We show with analysis, examples, and computational\nexperimentation that our methods are capable of generating highly likely\nsequences with a modest increase in computation over the greedy heuristic.\nWhile our analysis and experiments are focused on Markov chains of the type\narising in transformer and ChatGPT-like models, our methods apply to general\nfinite-state Markov chains, and related inference applications of Hidden Markov\nModels (HMM), where Viterbi decoding is used extensively.","PeriodicalId":501062,"journal":{"name":"arXiv - CS - Systems and Control","volume":"2016 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-03-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Systems and Control","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2403.15465","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
In this paper we consider a transformer with an $n$-gram structure, such as
the one underlying ChatGPT. The transformer provides next word probabilities,
which can be used to generate word sequences. We consider methods for computing
word sequences that are highly likely, based on these probabilities. Computing
the optimal (i.e., most likely) word sequence starting with a given initial
state is an intractable problem, so we propose methods to compute highly likely
sequences of $N$ words in time that is a low order polynomial in $N$ and in the
vocabulary size of the $n$-gram. These methods are based on the rollout
approach from approximate dynamic programming, a form of single policy
iteration, which can improve the performance of any given heuristic policy. In
our case we use a greedy heuristic that generates as next word one that has the
highest probability. We show with analysis, examples, and computational
experimentation that our methods are capable of generating highly likely
sequences with a modest increase in computation over the greedy heuristic.
While our analysis and experiments are focused on Markov chains of the type
arising in transformer and ChatGPT-like models, our methods apply to general
finite-state Markov chains, and related inference applications of Hidden Markov
Models (HMM), where Viterbi decoding is used extensively.