领域:轻度保守的基于模型的离线强化学习

IF 8.7 1区 计算机科学 Q1 AUTOMATION & CONTROL SYSTEMS
Xiao-Yin Liu;Xiao-Hu Zhou;Mei-Jiang Gui;Guo-Tao Li;Xiao-Liang Xie;Shi-Qi Liu;Shuang-Yi Wang;Qi-Chao Zhang;Biao Luo;Zeng-Guang Hou
{"title":"领域:轻度保守的基于模型的离线强化学习","authors":"Xiao-Yin Liu;Xiao-Hu Zhou;Mei-Jiang Gui;Guo-Tao Li;Xiao-Liang Xie;Shi-Qi Liu;Shuang-Yi Wang;Qi-Chao Zhang;Biao Luo;Zeng-Guang Hou","doi":"10.1109/TSMC.2025.3578666","DOIUrl":null,"url":null,"abstract":"Model-based reinforcement learning (RL), which learns an environment model from the offline dataset and generates more out-of-distribution model data, has become an effective approach to the problem of distribution shift in offline RL. Due to the gap between the learned and actual environment, conservatism should be incorporated into the algorithm to balance accurate offline data and imprecise model data. The conservatism of current algorithms mostly relies on model uncertainty estimation. However, uncertainty estimation is unreliable and leads to poor performance in certain scenarios, and the previous methods ignore differences between the model data, which brings great conservatism. To address the above issues, this article proposes a mildly conservative model-based offline RL algorithm (DOMAIN) without estimating model uncertainty, and designs the adaptive sampling distribution of model samples, which can adaptively adjust the model data penalty. In this article, we theoretically demonstrate that the Q value learned by the DOMAIN outside the region is a lower bound of the true Q value, the DOMAIN is less conservative than previous model-based offline RL algorithms, and has the guarantee of safety policy improvement. The results of extensive experiments show that DOMAIN outperforms prior RL algorithms and the average performance has improved by 1.8% on the D4RL benchmark.","PeriodicalId":48915,"journal":{"name":"IEEE Transactions on Systems Man Cybernetics-Systems","volume":"55 10","pages":"7142-7155"},"PeriodicalIF":8.7000,"publicationDate":"2025-07-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"DOMAIN: Mildly Conservative Model-Based Offline Reinforcement Learning\",\"authors\":\"Xiao-Yin Liu;Xiao-Hu Zhou;Mei-Jiang Gui;Guo-Tao Li;Xiao-Liang Xie;Shi-Qi Liu;Shuang-Yi Wang;Qi-Chao Zhang;Biao Luo;Zeng-Guang Hou\",\"doi\":\"10.1109/TSMC.2025.3578666\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Model-based reinforcement learning (RL), which learns an environment model from the offline dataset and generates more out-of-distribution model data, has become an effective approach to the problem of distribution shift in offline RL. Due to the gap between the learned and actual environment, conservatism should be incorporated into the algorithm to balance accurate offline data and imprecise model data. The conservatism of current algorithms mostly relies on model uncertainty estimation. However, uncertainty estimation is unreliable and leads to poor performance in certain scenarios, and the previous methods ignore differences between the model data, which brings great conservatism. To address the above issues, this article proposes a mildly conservative model-based offline RL algorithm (DOMAIN) without estimating model uncertainty, and designs the adaptive sampling distribution of model samples, which can adaptively adjust the model data penalty. In this article, we theoretically demonstrate that the Q value learned by the DOMAIN outside the region is a lower bound of the true Q value, the DOMAIN is less conservative than previous model-based offline RL algorithms, and has the guarantee of safety policy improvement. The results of extensive experiments show that DOMAIN outperforms prior RL algorithms and the average performance has improved by 1.8% on the D4RL benchmark.\",\"PeriodicalId\":48915,\"journal\":{\"name\":\"IEEE Transactions on Systems Man Cybernetics-Systems\",\"volume\":\"55 10\",\"pages\":\"7142-7155\"},\"PeriodicalIF\":8.7000,\"publicationDate\":\"2025-07-08\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"IEEE Transactions on Systems Man Cybernetics-Systems\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://ieeexplore.ieee.org/document/11072806/\",\"RegionNum\":1,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"AUTOMATION & CONTROL SYSTEMS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Systems Man Cybernetics-Systems","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/11072806/","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"AUTOMATION & CONTROL SYSTEMS","Score":null,"Total":0}
引用次数: 0

摘要

基于模型的强化学习(model -based reinforcement learning, RL)从离线数据集中学习环境模型,生成更多的分布外模型数据,已成为离线强化学习中解决分布移位问题的有效方法。由于学习到的数据与实际环境存在一定的差距,需要在算法中加入保守性来平衡准确的离线数据和不精确的模型数据。现有算法的保守性主要依赖于模型的不确定性估计。但是,不确定性估计是不可靠的,在某些场景下会导致性能不佳,并且之前的方法忽略了模型数据之间的差异,带来了很大的保守性。针对上述问题,本文在不估计模型不确定性的情况下,提出了一种基于轻度保守模型的离线RL算法(DOMAIN),并设计了模型样本的自适应采样分布,可以自适应调整模型数据惩罚。在本文中,我们从理论上证明了DOMAIN在区域外学习到的Q值是真实Q值的下界,DOMAIN比以前基于模型的离线RL算法更保守,并且具有安全策略改进的保证。大量的实验结果表明,DOMAIN优于先前的RL算法,在D4RL基准上的平均性能提高了1.8%。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
DOMAIN: Mildly Conservative Model-Based Offline Reinforcement Learning
Model-based reinforcement learning (RL), which learns an environment model from the offline dataset and generates more out-of-distribution model data, has become an effective approach to the problem of distribution shift in offline RL. Due to the gap between the learned and actual environment, conservatism should be incorporated into the algorithm to balance accurate offline data and imprecise model data. The conservatism of current algorithms mostly relies on model uncertainty estimation. However, uncertainty estimation is unreliable and leads to poor performance in certain scenarios, and the previous methods ignore differences between the model data, which brings great conservatism. To address the above issues, this article proposes a mildly conservative model-based offline RL algorithm (DOMAIN) without estimating model uncertainty, and designs the adaptive sampling distribution of model samples, which can adaptively adjust the model data penalty. In this article, we theoretically demonstrate that the Q value learned by the DOMAIN outside the region is a lower bound of the true Q value, the DOMAIN is less conservative than previous model-based offline RL algorithms, and has the guarantee of safety policy improvement. The results of extensive experiments show that DOMAIN outperforms prior RL algorithms and the average performance has improved by 1.8% on the D4RL benchmark.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
IEEE Transactions on Systems Man Cybernetics-Systems
IEEE Transactions on Systems Man Cybernetics-Systems AUTOMATION & CONTROL SYSTEMS-COMPUTER SCIENCE, CYBERNETICS
CiteScore
18.50
自引率
11.50%
发文量
812
审稿时长
6 months
期刊介绍: The IEEE Transactions on Systems, Man, and Cybernetics: Systems encompasses the fields of systems engineering, covering issue formulation, analysis, and modeling throughout the systems engineering lifecycle phases. It addresses decision-making, issue interpretation, systems management, processes, and various methods such as optimization, modeling, and simulation in the development and deployment of large systems.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信