Teaching Reinforcement Learning Agents via Reinforcement Learning

2023 57th Annual Conference on Information Sciences and Systems (CISS) Pub Date : 2023-03-22 DOI:10.1109/CISS56502.2023.10089695

Kun Yang, Chengshuai Shi, Cong Shen

{"title":"Teaching Reinforcement Learning Agents via Reinforcement Learning","authors":"Kun Yang, Chengshuai Shi, Cong Shen","doi":"10.1109/CISS56502.2023.10089695","DOIUrl":null,"url":null,"abstract":"In many real-world reinforcement learning (RL) tasks, the agent who takes the actions often only has partial observations of the environment. On the other hand, a principal may have a complete, system-level view but cannot directly take actions to interact with the environment. Motivated by this agent-principal capability mismatch, we study a novel “teaching” problem where the principal attempts to guide the agent's behavior via implicit adjustment on her observed rewards. Rather than solving specific instances of this problem, we develop a general RL framework for the principal to teach any RL agent without knowing the optimal action a priori. The key idea is to view the agent as part of the environment, and to directly set the reward adjustment as actions such that efficient learning and teaching can be simultaneously accomplished at the principal. This framework is fully adaptive to diverse principal and agent settings (such as heterogeneous agent strategies and adjustment costs), and can adopt a variety of RL algorithms to solve the teaching problem with provable performance guarantees. Extensive experimental results on different RL tasks demonstrate that the proposed framework guarantees a stable convergence and achieves the best tradeoff between rewards and costs among various baseline solutions.","PeriodicalId":243775,"journal":{"name":"2023 57th Annual Conference on Information Sciences and Systems (CISS)","volume":"13 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-03-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2023 57th Annual Conference on Information Sciences and Systems (CISS)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/CISS56502.2023.10089695","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

In many real-world reinforcement learning (RL) tasks, the agent who takes the actions often only has partial observations of the environment. On the other hand, a principal may have a complete, system-level view but cannot directly take actions to interact with the environment. Motivated by this agent-principal capability mismatch, we study a novel “teaching” problem where the principal attempts to guide the agent's behavior via implicit adjustment on her observed rewards. Rather than solving specific instances of this problem, we develop a general RL framework for the principal to teach any RL agent without knowing the optimal action a priori. The key idea is to view the agent as part of the environment, and to directly set the reward adjustment as actions such that efficient learning and teaching can be simultaneously accomplished at the principal. This framework is fully adaptive to diverse principal and agent settings (such as heterogeneous agent strategies and adjustment costs), and can adopt a variety of RL algorithms to solve the teaching problem with provable performance guarantees. Extensive experimental results on different RL tasks demonstrate that the proposed framework guarantees a stable convergence and achieves the best tradeoff between rewards and costs among various baseline solutions.

查看原文本刊更多论文

通过强化学习教学强化学习代理

在许多现实世界的强化学习(RL)任务中，采取行动的智能体通常只对环境有部分观察。另一方面，主体可能具有完整的系统级视图，但不能直接采取操作与环境进行交互。在这种代理-委托人能力不匹配的激励下，我们研究了一个新的“教学”问题，其中委托人试图通过对其观察到的奖励进行内隐调整来指导代理人的行为。我们没有解决这个问题的具体实例，而是为主体开发了一个通用的强化学习框架，以便在不知道最佳先验行为的情况下教任何强化学习代理。关键思想是将代理人视为环境的一部分，并直接将奖励调整设置为行动，以便有效的学习和教学可以同时在委托人处完成。该框架完全适应不同的主体和代理设置(如异构代理策略和调整成本)，并可以采用多种强化学习算法来解决具有可证明性能保证的教学问题。在不同RL任务上的大量实验结果表明，所提出的框架保证了稳定的收敛性，并在各种基线解决方案之间实现了回报和成本之间的最佳权衡。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2023 57th Annual Conference on Information Sciences and Systems (CISS)

自引率

0.00%

发文量