Multi-trainer binary feedback interactive reinforcement learning

IF 1 4区计算机科学 Q4 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Annals of Mathematics and Artificial Intelligence Pub Date : 2024-10-02 DOI:10.1007/s10472-024-09956-4

Zhaori Guo, Timothy J. Norman, Enrico H. Gerding

{"title":"Multi-trainer binary feedback interactive reinforcement learning","authors":"Zhaori Guo, Timothy J. Norman, Enrico H. Gerding","doi":"10.1007/s10472-024-09956-4","DOIUrl":null,"url":null,"abstract":"<div><p>Interactive reinforcement learning is an effective way to train agents via human feedback. However, it often requires the <i>trainer</i> (a human who provides feedback to the agent) to know the correct action for the agent. If the trainer is not always reliable, the wrong feedback may hinder the agent’s training. In addition, there is no consensus on the best form of human feedback in interactive reinforcement learning. To address these problems, in this paper, we explore the performance of binary reward as the reward form. Moreover, we propose a novel interactive reinforcement learning system called Multi-Trainer Interactive Reinforcement Learning (MTIRL), which can aggregate binary feedback from multiple imperfect trainers into a reliable reward for agent training in a reward-sparse environment. In addition, the review model in MTIRL can correct the unreliable rewards. In particular, our experiments for evaluating reward forms show that binary reward outperforms other reward forms, including ranking reward, scaling reward, and state value reward. In addition, our question-answer experiments show that our aggregation method outperforms the state-of-the-art aggregation methods, including majority voting, weighted voting, and the Bayesian aggregation method. Finally, we conduct grid-world experiments to show that the policy trained by the MTIRL with the review model is closer to the optimal policy than that without a review model.</p></div>","PeriodicalId":7971,"journal":{"name":"Annals of Mathematics and Artificial Intelligence","volume":"93 4","pages":"491 - 516"},"PeriodicalIF":1.0000,"publicationDate":"2024-10-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Annals of Mathematics and Artificial Intelligence","FirstCategoryId":"94","ListUrlMain":"https://link.springer.com/article/10.1007/s10472-024-09956-4","RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q4","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

Abstract

Interactive reinforcement learning is an effective way to train agents via human feedback. However, it often requires the trainer (a human who provides feedback to the agent) to know the correct action for the agent. If the trainer is not always reliable, the wrong feedback may hinder the agent’s training. In addition, there is no consensus on the best form of human feedback in interactive reinforcement learning. To address these problems, in this paper, we explore the performance of binary reward as the reward form. Moreover, we propose a novel interactive reinforcement learning system called Multi-Trainer Interactive Reinforcement Learning (MTIRL), which can aggregate binary feedback from multiple imperfect trainers into a reliable reward for agent training in a reward-sparse environment. In addition, the review model in MTIRL can correct the unreliable rewards. In particular, our experiments for evaluating reward forms show that binary reward outperforms other reward forms, including ranking reward, scaling reward, and state value reward. In addition, our question-answer experiments show that our aggregation method outperforms the state-of-the-art aggregation methods, including majority voting, weighted voting, and the Bayesian aggregation method. Finally, we conduct grid-world experiments to show that the policy trained by the MTIRL with the review model is closer to the optimal policy than that without a review model.

查看原文本刊更多论文

多训练器二元反馈交互式强化学习

交互式强化学习是一种通过人的反馈来训练智能体的有效方法。然而，它通常需要训练者（向代理提供反馈的人）知道代理的正确动作。如果训练师不总是可靠的，错误的反馈可能会阻碍代理的训练。此外，在交互式强化学习中，人类反馈的最佳形式还没有达成共识。为了解决这些问题，本文探讨了二元奖励作为奖励形式的性能。此外，我们提出了一种新的交互式强化学习系统，称为多训练器交互式强化学习（MTIRL），它可以将来自多个不完美训练器的二值反馈聚合成一个可靠的奖励，用于奖励稀疏环境下的智能体训练。此外，MTIRL中的复习模型可以纠正不可靠的奖励。特别是，我们评估奖励形式的实验表明，二元奖励优于其他奖励形式，包括排名奖励、缩放奖励和状态价值奖励。此外，我们的问答实验表明，我们的聚合方法优于最先进的聚合方法，包括多数投票、加权投票和贝叶斯聚合方法。最后，我们进行了网格世界实验，结果表明，使用审查模型的MTIRL训练的策略比没有审查模型的MTIRL训练的策略更接近最优策略。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Annals of Mathematics and Artificial Intelligence 工程技术-计算机：人工智能

CiteScore

3.00

自引率

8.30%

发文量

审稿时长

>12 weeks

期刊介绍： Annals of Mathematics and Artificial Intelligence presents a range of topics of concern to scholars applying quantitative, combinatorial, logical, algebraic and algorithmic methods to diverse areas of Artificial Intelligence, from decision support, automated deduction, and reasoning, to knowledge-based systems, machine learning, computer vision, robotics and planning. The journal features collections of papers appearing either in volumes (400 pages) or in separate issues (100-300 pages), which focus on one topic and have one or more guest editors. Annals of Mathematics and Artificial Intelligence hopes to influence the spawning of new areas of applied mathematics and strengthen the scientific underpinnings of Artificial Intelligence.