An Efficient Policy Improvement in Human Interactive Learning Using Entropy

2021 International Conference on Information and Communication Technology Convergence (ICTC) Pub Date : 2021-10-20 DOI:10.1109/ICTC52510.2021.9620856

Sungyun Park, Dae-Wook Kim, Sang-Kwang Lee, Seong-il Yang

{"title":"An Efficient Policy Improvement in Human Interactive Learning Using Entropy","authors":"Sungyun Park, Dae-Wook Kim, Sang-Kwang Lee, Seong-il Yang","doi":"10.1109/ICTC52510.2021.9620856","DOIUrl":null,"url":null,"abstract":"Human knowledge is used in reinforcement learning (RL), which reduces the amount of time taken by the learning agent to achieve its goal. The TAMER (Training an Agent Manually via Evaluative Reinforcements) algorithm allows a human to provide a reward to an autonomous agent through a manual interface while watching the agent performs the action. Because a policy, the agent have, is updated based on human rewards, it approximates how a human trainer gives rewards to the agent. For policy update, events that occurred during learning were selected. Furthermore, while selecting events, the temporal distance from the event to the human reward is considered. Thus, the events that only occurred in a certain time interval before the human trainer gives a reward are selected. However, this approach of considering only the time factor demands quite many human rewards for the policy. The policy update with high complexity make the human trainer exhausted during improvement of policy. Therefore, we propose a new method of selecting events, which considers the entropy value over the distribution of Q-values, in addition to the time factor. For the policy update in our proposed event selection method, we reuse the events despite of long temporal distance since human reward when their each human reward is negative and entropy value (over the distribution of Q-values) is low. To compare the effectiveness of the proposed method with the classic TAMER, we implement an experiment with the policy initialized to an incorrect weight. The results show that the TAMER algorithm, using our proposed selection of events, efficiently improves the policy.","PeriodicalId":299175,"journal":{"name":"2021 International Conference on Information and Communication Technology Convergence (ICTC)","volume":"44 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-10-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2021 International Conference on Information and Communication Technology Convergence (ICTC)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICTC52510.2021.9620856","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Human knowledge is used in reinforcement learning (RL), which reduces the amount of time taken by the learning agent to achieve its goal. The TAMER (Training an Agent Manually via Evaluative Reinforcements) algorithm allows a human to provide a reward to an autonomous agent through a manual interface while watching the agent performs the action. Because a policy, the agent have, is updated based on human rewards, it approximates how a human trainer gives rewards to the agent. For policy update, events that occurred during learning were selected. Furthermore, while selecting events, the temporal distance from the event to the human reward is considered. Thus, the events that only occurred in a certain time interval before the human trainer gives a reward are selected. However, this approach of considering only the time factor demands quite many human rewards for the policy. The policy update with high complexity make the human trainer exhausted during improvement of policy. Therefore, we propose a new method of selecting events, which considers the entropy value over the distribution of Q-values, in addition to the time factor. For the policy update in our proposed event selection method, we reuse the events despite of long temporal distance since human reward when their each human reward is negative and entropy value (over the distribution of Q-values) is low. To compare the effectiveness of the proposed method with the classic TAMER, we implement an experiment with the policy initialized to an incorrect weight. The results show that the TAMER algorithm, using our proposed selection of events, efficiently improves the policy.

查看原文本刊更多论文

基于熵的人机交互学习有效策略改进

人类的知识被用于强化学习(RL)，这减少了学习代理实现其目标所需的时间。TAMER(通过评估强化手动训练代理)算法允许人类在观察代理执行动作的同时，通过手动界面向自主代理提供奖励。因为策略(agent拥有)是基于人类奖励而更新的，所以它近似于人类训练者如何给agent奖励。对于策略更新，选择了学习期间发生的事件。此外，在选择事件时，考虑了从事件到人类奖励的时间距离。因此，在人类训练师给出奖励之前，只在特定时间间隔内发生的事件被选择。然而，这种只考虑时间因素的方法需要为该政策提供相当多的人力奖励。高复杂度的策略更新使人工训练器在策略改进过程中筋疲力尽。因此，我们提出了一种新的选择事件的方法，除了考虑时间因素外，还考虑了q值分布上的熵值。对于我们提出的事件选择方法中的策略更新，当每个人的奖励为负且熵值(超过q值的分布)较低时，尽管人类奖励的时间距离较长，但我们仍然重用事件。为了比较该方法与经典的TAMER方法的有效性，我们实现了一个初始化为不正确权重的策略实验。结果表明，采用我们提出的事件选择算法的TAMER算法有效地改进了策略。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2021 International Conference on Information and Communication Technology Convergence (ICTC)

自引率

0.00%

发文量