{"title":"Selective imitation for efficient online reinforcement learning with pre-collected data","authors":"Chanin Eom , Dongsu Lee , Minhae Kwon","doi":"10.1016/j.icte.2024.09.001","DOIUrl":null,"url":null,"abstract":"<div><div>Deep reinforcement learning (RL) has emerged as a promising solution for autonomous devices requiring sequential decision-making. In the online RL framework, the agent must interact with the environment to collect data, making sample efficiency the most challenging aspect. While the off-policy method in online RL partially addresses this issue by employing a replay buffer, learning speed remains slow, particularly at the beginning of training, due to the low quality of data collected with the initial policy. To overcome this challenge, we propose Reward-Adaptive Pre-collected Data RL (RAPD-RL), which leverages pre-collected data in addition to online RL. We employ two buffers: one for pre-collected data and another for online collected data. The policy is trained using both buffers to increase the <span><math><mi>Q</mi></math></span> objective and imitate the actions in the dataset. To maintain resistance to poor-quality (i.e., low-reward) data, our method selectively imitates data based on reward information, thereby enhancing sample efficiency and learning speed. Simulation results demonstrate that the proposed solution converges rapidly and achieves high performance across various dataset qualities.</div></div>","PeriodicalId":48526,"journal":{"name":"ICT Express","volume":"10 6","pages":"Pages 1308-1314"},"PeriodicalIF":4.1000,"publicationDate":"2024-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"ICT Express","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S2405959524001048","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}
引用次数: 0
Abstract
Deep reinforcement learning (RL) has emerged as a promising solution for autonomous devices requiring sequential decision-making. In the online RL framework, the agent must interact with the environment to collect data, making sample efficiency the most challenging aspect. While the off-policy method in online RL partially addresses this issue by employing a replay buffer, learning speed remains slow, particularly at the beginning of training, due to the low quality of data collected with the initial policy. To overcome this challenge, we propose Reward-Adaptive Pre-collected Data RL (RAPD-RL), which leverages pre-collected data in addition to online RL. We employ two buffers: one for pre-collected data and another for online collected data. The policy is trained using both buffers to increase the objective and imitate the actions in the dataset. To maintain resistance to poor-quality (i.e., low-reward) data, our method selectively imitates data based on reward information, thereby enhancing sample efficiency and learning speed. Simulation results demonstrate that the proposed solution converges rapidly and achieves high performance across various dataset qualities.
期刊介绍:
The ICT Express journal published by the Korean Institute of Communications and Information Sciences (KICS) is an international, peer-reviewed research publication covering all aspects of information and communication technology. The journal aims to publish research that helps advance the theoretical and practical understanding of ICT convergence, platform technologies, communication networks, and device technologies. The technology advancement in information and communication technology (ICT) sector enables portable devices to be always connected while supporting high data rate, resulting in the recent popularity of smartphones that have a considerable impact in economic and social development.