{"title":"MaxEnt做梦者:世界模型的最大熵强化学习","authors":"Hongying Ma, Wuyang Xue, R. Ying, Peilin Liu","doi":"10.1109/IJCNN55064.2022.9892381","DOIUrl":null,"url":null,"abstract":"Model-based reinforcement learning algorithms can alleviate the low sample efficiency problem compared with modelfree methods for control tasks. However, the learned policy's performance often lags behind the best model-free algorithms since its weak exploration ability. Existing model-based reinforcement learning algorithms learn policy by interacting with the learned world model and then use the learned policy to guide a new round of world model learning. Due to weak policy exploration ability, the learned world model has a large bias. As a result, it fails to learn the globally optimal policy on such a world model. This paper improves the learned world model by maximizing both the reward and the corresponding policy entropy in the framework of maximum entropy reinforcement learning. The effectiveness of applying the maximum entropy approach to model-based reinforcement learning is supported by the better performance of our algorithm on several complex mujoco and deepmind control suite tasks.","PeriodicalId":106974,"journal":{"name":"2022 International Joint Conference on Neural Networks (IJCNN)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-07-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"MaxEnt Dreamer: Maximum Entropy Reinforcement Learning with World Model\",\"authors\":\"Hongying Ma, Wuyang Xue, R. Ying, Peilin Liu\",\"doi\":\"10.1109/IJCNN55064.2022.9892381\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Model-based reinforcement learning algorithms can alleviate the low sample efficiency problem compared with modelfree methods for control tasks. However, the learned policy's performance often lags behind the best model-free algorithms since its weak exploration ability. Existing model-based reinforcement learning algorithms learn policy by interacting with the learned world model and then use the learned policy to guide a new round of world model learning. Due to weak policy exploration ability, the learned world model has a large bias. As a result, it fails to learn the globally optimal policy on such a world model. This paper improves the learned world model by maximizing both the reward and the corresponding policy entropy in the framework of maximum entropy reinforcement learning. The effectiveness of applying the maximum entropy approach to model-based reinforcement learning is supported by the better performance of our algorithm on several complex mujoco and deepmind control suite tasks.\",\"PeriodicalId\":106974,\"journal\":{\"name\":\"2022 International Joint Conference on Neural Networks (IJCNN)\",\"volume\":\"1 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2022-07-18\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2022 International Joint Conference on Neural Networks (IJCNN)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/IJCNN55064.2022.9892381\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2022 International Joint Conference on Neural Networks (IJCNN)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/IJCNN55064.2022.9892381","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
MaxEnt Dreamer: Maximum Entropy Reinforcement Learning with World Model
Model-based reinforcement learning algorithms can alleviate the low sample efficiency problem compared with modelfree methods for control tasks. However, the learned policy's performance often lags behind the best model-free algorithms since its weak exploration ability. Existing model-based reinforcement learning algorithms learn policy by interacting with the learned world model and then use the learned policy to guide a new round of world model learning. Due to weak policy exploration ability, the learned world model has a large bias. As a result, it fails to learn the globally optimal policy on such a world model. This paper improves the learned world model by maximizing both the reward and the corresponding policy entropy in the framework of maximum entropy reinforcement learning. The effectiveness of applying the maximum entropy approach to model-based reinforcement learning is supported by the better performance of our algorithm on several complex mujoco and deepmind control suite tasks.