使用无奖励探索性数据的离线模仿学习

Proceedings of the 2022 5th International Conference on Algorithms, Computing and Artificial Intelligence Pub Date : 2022-12-23 DOI:10.1145/3579654.3579753

Hao Wang, Dawei Feng, Bo Ding, W. Li

{"title":"使用无奖励探索性数据的离线模仿学习","authors":"Hao Wang, Dawei Feng, Bo Ding, W. Li","doi":"10.1145/3579654.3579753","DOIUrl":null,"url":null,"abstract":"Offline imitative learning(OIL) is often used to solve complex continuous decision-making tasks. For these tasks such as robot control, automatic driving and etc., it is either difficult to design an effective reward for learning or very expensive and time-consuming for agents to collect data interactively with the environment. However, the data used in previous OIL methods are all gathered by reinforcement learning algorithms guided by task-specific rewards, which is not a true reward-free premise and still suffers from the problem of designing an effective reward function in real tasks. To this end, we propose the reward-free exploratory data driven offline imitation learning (ExDOIL) framework. ExDOIL first trains an unsupervised reinforcement learning agent by interacting with the environment, and collects enough unsupervised exploration data during training; Then, a task independent yet simple and efficient reward function is used to relabel the collected data; Finally, an agent is trained to imitate the expert to complete the task through a conventional RL algorithm such as TD3. Extensive experiments on continuous control tasks demonstrate that the proposed framework can achieve better imitation performance(28% higher episode returns on average) comparing with previous SOTA method(ORIL) without any task-specific rewards.","PeriodicalId":146783,"journal":{"name":"Proceedings of the 2022 5th International Conference on Algorithms, Computing and Artificial Intelligence","volume":"24 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-12-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Offline Imitation Learning Using Reward-free Exploratory Data\",\"authors\":\"Hao Wang, Dawei Feng, Bo Ding, W. Li\",\"doi\":\"10.1145/3579654.3579753\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Offline imitative learning(OIL) is often used to solve complex continuous decision-making tasks. For these tasks such as robot control, automatic driving and etc., it is either difficult to design an effective reward for learning or very expensive and time-consuming for agents to collect data interactively with the environment. However, the data used in previous OIL methods are all gathered by reinforcement learning algorithms guided by task-specific rewards, which is not a true reward-free premise and still suffers from the problem of designing an effective reward function in real tasks. To this end, we propose the reward-free exploratory data driven offline imitation learning (ExDOIL) framework. ExDOIL first trains an unsupervised reinforcement learning agent by interacting with the environment, and collects enough unsupervised exploration data during training; Then, a task independent yet simple and efficient reward function is used to relabel the collected data; Finally, an agent is trained to imitate the expert to complete the task through a conventional RL algorithm such as TD3. Extensive experiments on continuous control tasks demonstrate that the proposed framework can achieve better imitation performance(28% higher episode returns on average) comparing with previous SOTA method(ORIL) without any task-specific rewards.\",\"PeriodicalId\":146783,\"journal\":{\"name\":\"Proceedings of the 2022 5th International Conference on Algorithms, Computing and Artificial Intelligence\",\"volume\":\"24 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2022-12-23\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the 2022 5th International Conference on Algorithms, Computing and Artificial Intelligence\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3579654.3579753\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 2022 5th International Conference on Algorithms, Computing and Artificial Intelligence","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3579654.3579753","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

离线模仿学习(OIL)常用于解决复杂的连续决策任务。对于机器人控制、自动驾驶等任务，要么难以设计有效的学习奖励，要么智能体与环境交互收集数据非常昂贵和耗时。然而，以往的OIL方法使用的数据都是由任务特定奖励引导的强化学习算法收集的，这并不是真正的无奖励前提，在实际任务中仍然存在设计有效奖励函数的问题。为此，我们提出了无奖励的探索性数据驱动的离线模仿学习(ExDOIL)框架。ExDOIL首先通过与环境的交互训练无监督强化学习代理，并在训练过程中收集足够的无监督探索数据;然后，使用任务无关但简单有效的奖励函数对收集到的数据进行重新标注;最后，通过传统的RL算法(如TD3)训练一个代理来模仿专家完成任务。在连续控制任务上的大量实验表明，与之前没有任何任务特定奖励的SOTA方法(ORIL)相比，所提出的框架可以获得更好的模仿性能(平均高28%的剧集回报)。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Offline Imitation Learning Using Reward-free Exploratory Data

Offline imitative learning(OIL) is often used to solve complex continuous decision-making tasks. For these tasks such as robot control, automatic driving and etc., it is either difficult to design an effective reward for learning or very expensive and time-consuming for agents to collect data interactively with the environment. However, the data used in previous OIL methods are all gathered by reinforcement learning algorithms guided by task-specific rewards, which is not a true reward-free premise and still suffers from the problem of designing an effective reward function in real tasks. To this end, we propose the reward-free exploratory data driven offline imitation learning (ExDOIL) framework. ExDOIL first trains an unsupervised reinforcement learning agent by interacting with the environment, and collects enough unsupervised exploration data during training; Then, a task independent yet simple and efficient reward function is used to relabel the collected data; Finally, an agent is trained to imitate the expert to complete the task through a conventional RL algorithm such as TD3. Extensive experiments on continuous control tasks demonstrate that the proposed framework can achieve better imitation performance(28% higher episode returns on average) comparing with previous SOTA method(ORIL) without any task-specific rewards.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Proceedings of the 2022 5th International Conference on Algorithms, Computing and Artificial Intelligence

自引率

0.00%

发文量