Sample-Efficient Reinforcement Learning From Human Feedback via Information-Directed Sampling

IF 2.9 3区计算机科学 Q3 COMPUTER SCIENCE, INFORMATION SYSTEMS

IEEE Transactions on Information Theory Pub Date : 2025-08-13 DOI:10.1109/TIT.2025.3598296

Han Qi;Haochen Yang;Qiaosheng Zhang;Zhuoran Yang

{"title":"Sample-Efficient Reinforcement Learning From Human Feedback via Information-Directed Sampling","authors":"Han Qi;Haochen Yang;Qiaosheng Zhang;Zhuoran Yang","doi":"10.1109/TIT.2025.3598296","DOIUrl":null,"url":null,"abstract":"We study the problem of reinforcement learning from human feedback (RLHF), a critical problem in training large language models, from a theoretical perspective. Our main contribution is the design of novel sample-efficient RLHF algorithms based on information-directed sampling (IDS), an online decision-making principle inspired by information theory. Our algorithms maximize the sum of the value function and a mutual information term that encourages exploration of the unknown environment (which quantifies the information gained about the environment through observed human feedback data). To tackle the challenge of large state spaces and improve sample efficiency, we construct a simplified <italic>surrogate environment and introduce a novel distance measure (named the <inline-formula> <tex-math>$\\ell _{g}$ </tex-math></inline-formula><italic>-distance), enabling our IDS-based algorithm to achieve a Bayesian regret upper bound of order <inline-formula> <tex-math>$O(H^{3/2}\\sqrt {\\log (K(\\epsilon)) T})$ </tex-math></inline-formula>, where <italic>H is the episode length, <italic>T is the number of episode and <inline-formula> <tex-math>$K(\\epsilon)$ </tex-math></inline-formula> is related to the covering number of the environment. Specializing to the tabular settings, this regret bound is of order <inline-formula> <tex-math>$\\tilde {O}(H^{2}\\sqrt {SAT})$ </tex-math></inline-formula>, where <italic>S and <italic>A are the numbers of states and actions. Finally, we propose an Approximate-IDS algorithm that is computationally more efficient while maintaining nearly the same sample efficiency. The design principle of this approximate algorithm is not only effective in RLHF settings but also applicable to the standard RL framework. Moreover, our work showcases the value of information theory in reinforcement learning and in the training of large language models.","PeriodicalId":13494,"journal":{"name":"IEEE Transactions on Information Theory","volume":"71 10","pages":"7942-7958"},"PeriodicalIF":2.9000,"publicationDate":"2025-08-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Information Theory","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/11123904/","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}

引用次数: 0

Abstract

We study the problem of reinforcement learning from human feedback (RLHF), a critical problem in training large language models, from a theoretical perspective. Our main contribution is the design of novel sample-efficient RLHF algorithms based on information-directed sampling (IDS), an online decision-making principle inspired by information theory. Our algorithms maximize the sum of the value function and a mutual information term that encourages exploration of the unknown environment (which quantifies the information gained about the environment through observed human feedback data). To tackle the challenge of large state spaces and improve sample efficiency, we construct a simplified surrogate environment and introduce a novel distance measure (named the

$\ell _{g}$

-distance), enabling our IDS-based algorithm to achieve a Bayesian regret upper bound of order

$O(H^{3/2}\sqrt {\log (K(\epsilon)) T})$

, where H is the episode length, T is the number of episode and

$K(\epsilon)$

is related to the covering number of the environment. Specializing to the tabular settings, this regret bound is of order

$\tilde {O}(H^{2}\sqrt {SAT})$

, where S and A are the numbers of states and actions. Finally, we propose an Approximate-IDS algorithm that is computationally more efficient while maintaining nearly the same sample efficiency. The design principle of this approximate algorithm is not only effective in RLHF settings but also applicable to the standard RL framework. Moreover, our work showcases the value of information theory in reinforcement learning and in the training of large language models.

查看原文本刊更多论文

基于信息导向采样的人类反馈样本高效强化学习

本文从理论角度研究了基于人类反馈的强化学习（RLHF）问题，这是训练大型语言模型的一个关键问题。我们的主要贡献是基于信息导向采样（IDS）的新型样本高效RLHF算法，这是一种受信息论启发的在线决策原则。我们的算法最大化了价值函数和相互信息项的总和，鼓励对未知环境的探索（通过观察到的人类反馈数据量化获得的关于环境的信息）。为了解决大状态空间的挑战并提高样本效率，我们构建了一个简化的代理环境，并引入了一种新的距离度量（命名为$\ell _{g}$ -distance），使我们基于ids的算法能够实现阶数$O(H^{3/2}\sqrt {\log (K(\epsilon)) T})$的贝叶斯遗憾上界，其中H为集长，T为集数，$K(\epsilon)$与环境的覆盖数有关。专门针对表格设置，此遗憾绑定的顺序为$\tilde {O}(H^{2}\sqrt {SAT})$，其中S和A是状态和操作的数量。最后，我们提出了一种近似ids算法，该算法在保持几乎相同的样本效率的同时计算效率更高。该近似算法的设计原理不仅适用于RLHF设置，而且适用于标准RL框架。此外，我们的工作展示了信息理论在强化学习和大型语言模型训练中的价值。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

IEEE Transactions on Information Theory 工程技术-工程：电子与电气

CiteScore

5.70

自引率

20.00%

发文量

514

审稿时长

12 months

期刊介绍： The IEEE Transactions on Information Theory is a journal that publishes theoretical and experimental papers concerned with the transmission, processing, and utilization of information. The boundaries of acceptable subject matter are intentionally not sharply delimited. Rather, it is hoped that as the focus of research activity changes, a flexible policy will permit this Transactions to follow suit. Current appropriate topics are best reflected by recent Tables of Contents; they are summarized in the titles of editorial areas that appear on the inside front cover.