{"title":"奖励增强解码:利用单向奖励模型高效生成受控文本","authors":"H. Deng, Colin Raffel","doi":"10.18653/v1/2023.emnlp-main.721","DOIUrl":null,"url":null,"abstract":"While large language models have proven effective in a huge range of downstream applications, they often generate text that is problematic or lacks a desired attribute. In this paper, we introduce Reward-Augmented Decoding (RAD), a text generation procedure that uses a small unidirectional reward model to encourage a language model to generate text that has certain properties. Specifically, RAD uses the reward model to score generations as they are produced and rescales sampling probabilities to favor high-reward tokens. By using a unidirectional reward model, RAD can cache activations from prior generation steps to decrease computational overhead. Through experiments on generating non-toxic and sentiment-controlled text, we demonstrate that RAD performs best among methods that change only the generation procedure and matches the performance of state-of-the-art methods that involve re-training the language model. We further validate that RAD is effective on very large language models while incurring a minimal computational overhead.","PeriodicalId":505350,"journal":{"name":"Conference on Empirical Methods in Natural Language Processing","volume":"4 1","pages":"11781-11791"},"PeriodicalIF":0.0000,"publicationDate":"2023-10-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Reward-Augmented Decoding: Efficient Controlled Text Generation With a Unidirectional Reward Model\",\"authors\":\"H. Deng, Colin Raffel\",\"doi\":\"10.18653/v1/2023.emnlp-main.721\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"While large language models have proven effective in a huge range of downstream applications, they often generate text that is problematic or lacks a desired attribute. In this paper, we introduce Reward-Augmented Decoding (RAD), a text generation procedure that uses a small unidirectional reward model to encourage a language model to generate text that has certain properties. Specifically, RAD uses the reward model to score generations as they are produced and rescales sampling probabilities to favor high-reward tokens. By using a unidirectional reward model, RAD can cache activations from prior generation steps to decrease computational overhead. Through experiments on generating non-toxic and sentiment-controlled text, we demonstrate that RAD performs best among methods that change only the generation procedure and matches the performance of state-of-the-art methods that involve re-training the language model. We further validate that RAD is effective on very large language models while incurring a minimal computational overhead.\",\"PeriodicalId\":505350,\"journal\":{\"name\":\"Conference on Empirical Methods in Natural Language Processing\",\"volume\":\"4 1\",\"pages\":\"11781-11791\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2023-10-14\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Conference on Empirical Methods in Natural Language Processing\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.18653/v1/2023.emnlp-main.721\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Conference on Empirical Methods in Natural Language Processing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.18653/v1/2023.emnlp-main.721","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
摘要
虽然大型语言模型在大量下游应用中被证明是有效的,但它们生成的文本往往存在问题或缺乏所需的属性。在本文中,我们介绍了奖励增强解码(RAD),这是一种文本生成程序,它使用小型单向奖励模型来鼓励语言模型生成具有特定属性的文本。具体来说,RAD 使用奖励模型对生成的文本进行评分,并重新调整采样概率,使其更倾向于高奖励标记。通过使用单向奖励模型,RAD 可以缓存先前生成步骤的激活,从而减少计算开销。通过生成无毒文本和情感控制文本的实验,我们证明了 RAD 在仅改变生成过程的方法中表现最佳,其性能与涉及重新训练语言模型的最先进方法不相上下。我们进一步验证了 RAD 在超大语言模型上的有效性,同时将计算开销降到了最低。
Reward-Augmented Decoding: Efficient Controlled Text Generation With a Unidirectional Reward Model
While large language models have proven effective in a huge range of downstream applications, they often generate text that is problematic or lacks a desired attribute. In this paper, we introduce Reward-Augmented Decoding (RAD), a text generation procedure that uses a small unidirectional reward model to encourage a language model to generate text that has certain properties. Specifically, RAD uses the reward model to score generations as they are produced and rescales sampling probabilities to favor high-reward tokens. By using a unidirectional reward model, RAD can cache activations from prior generation steps to decrease computational overhead. Through experiments on generating non-toxic and sentiment-controlled text, we demonstrate that RAD performs best among methods that change only the generation procedure and matches the performance of state-of-the-art methods that involve re-training the language model. We further validate that RAD is effective on very large language models while incurring a minimal computational overhead.