Sungyun Park, Dae-Wook Kim, Sang-Kwang Lee, Seong-il Yang
{"title":"An Efficient Policy Improvement in Human Interactive Learning Using Entropy","authors":"Sungyun Park, Dae-Wook Kim, Sang-Kwang Lee, Seong-il Yang","doi":"10.1109/ICTC52510.2021.9620856","DOIUrl":null,"url":null,"abstract":"Human knowledge is used in reinforcement learning (RL), which reduces the amount of time taken by the learning agent to achieve its goal. The TAMER (Training an Agent Manually via Evaluative Reinforcements) algorithm allows a human to provide a reward to an autonomous agent through a manual interface while watching the agent performs the action. Because a policy, the agent have, is updated based on human rewards, it approximates how a human trainer gives rewards to the agent. For policy update, events that occurred during learning were selected. Furthermore, while selecting events, the temporal distance from the event to the human reward is considered. Thus, the events that only occurred in a certain time interval before the human trainer gives a reward are selected. However, this approach of considering only the time factor demands quite many human rewards for the policy. The policy update with high complexity make the human trainer exhausted during improvement of policy. Therefore, we propose a new method of selecting events, which considers the entropy value over the distribution of Q-values, in addition to the time factor. For the policy update in our proposed event selection method, we reuse the events despite of long temporal distance since human reward when their each human reward is negative and entropy value (over the distribution of Q-values) is low. To compare the effectiveness of the proposed method with the classic TAMER, we implement an experiment with the policy initialized to an incorrect weight. The results show that the TAMER algorithm, using our proposed selection of events, efficiently improves the policy.","PeriodicalId":299175,"journal":{"name":"2021 International Conference on Information and Communication Technology Convergence (ICTC)","volume":"44 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-10-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2021 International Conference on Information and Communication Technology Convergence (ICTC)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICTC52510.2021.9620856","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
Human knowledge is used in reinforcement learning (RL), which reduces the amount of time taken by the learning agent to achieve its goal. The TAMER (Training an Agent Manually via Evaluative Reinforcements) algorithm allows a human to provide a reward to an autonomous agent through a manual interface while watching the agent performs the action. Because a policy, the agent have, is updated based on human rewards, it approximates how a human trainer gives rewards to the agent. For policy update, events that occurred during learning were selected. Furthermore, while selecting events, the temporal distance from the event to the human reward is considered. Thus, the events that only occurred in a certain time interval before the human trainer gives a reward are selected. However, this approach of considering only the time factor demands quite many human rewards for the policy. The policy update with high complexity make the human trainer exhausted during improvement of policy. Therefore, we propose a new method of selecting events, which considers the entropy value over the distribution of Q-values, in addition to the time factor. For the policy update in our proposed event selection method, we reuse the events despite of long temporal distance since human reward when their each human reward is negative and entropy value (over the distribution of Q-values) is low. To compare the effectiveness of the proposed method with the classic TAMER, we implement an experiment with the policy initialized to an incorrect weight. The results show that the TAMER algorithm, using our proposed selection of events, efficiently improves the policy.