DOMAIN: Mildly Conservative Model-Based Offline Reinforcement Learning

IF 8.7 1区计算机科学 Q1 AUTOMATION & CONTROL SYSTEMS

IEEE Transactions on Systems Man Cybernetics-Systems Pub Date : 2025-07-08 DOI:10.1109/TSMC.2025.3578666

Xiao-Yin Liu;Xiao-Hu Zhou;Mei-Jiang Gui;Guo-Tao Li;Xiao-Liang Xie;Shi-Qi Liu;Shuang-Yi Wang;Qi-Chao Zhang;Biao Luo;Zeng-Guang Hou

{"title":"DOMAIN: Mildly Conservative Model-Based Offline Reinforcement Learning","authors":"Xiao-Yin Liu;Xiao-Hu Zhou;Mei-Jiang Gui;Guo-Tao Li;Xiao-Liang Xie;Shi-Qi Liu;Shuang-Yi Wang;Qi-Chao Zhang;Biao Luo;Zeng-Guang Hou","doi":"10.1109/TSMC.2025.3578666","DOIUrl":null,"url":null,"abstract":"Model-based reinforcement learning (RL), which learns an environment model from the offline dataset and generates more out-of-distribution model data, has become an effective approach to the problem of distribution shift in offline RL. Due to the gap between the learned and actual environment, conservatism should be incorporated into the algorithm to balance accurate offline data and imprecise model data. The conservatism of current algorithms mostly relies on model uncertainty estimation. However, uncertainty estimation is unreliable and leads to poor performance in certain scenarios, and the previous methods ignore differences between the model data, which brings great conservatism. To address the above issues, this article proposes a mildly conservative model-based offline RL algorithm (DOMAIN) without estimating model uncertainty, and designs the adaptive sampling distribution of model samples, which can adaptively adjust the model data penalty. In this article, we theoretically demonstrate that the Q value learned by the DOMAIN outside the region is a lower bound of the true Q value, the DOMAIN is less conservative than previous model-based offline RL algorithms, and has the guarantee of safety policy improvement. The results of extensive experiments show that DOMAIN outperforms prior RL algorithms and the average performance has improved by 1.8% on the D4RL benchmark.","PeriodicalId":48915,"journal":{"name":"IEEE Transactions on Systems Man Cybernetics-Systems","volume":"55 10","pages":"7142-7155"},"PeriodicalIF":8.7000,"publicationDate":"2025-07-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Systems Man Cybernetics-Systems","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/11072806/","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"AUTOMATION & CONTROL SYSTEMS","Score":null,"Total":0}

引用次数: 0

Abstract

Model-based reinforcement learning (RL), which learns an environment model from the offline dataset and generates more out-of-distribution model data, has become an effective approach to the problem of distribution shift in offline RL. Due to the gap between the learned and actual environment, conservatism should be incorporated into the algorithm to balance accurate offline data and imprecise model data. The conservatism of current algorithms mostly relies on model uncertainty estimation. However, uncertainty estimation is unreliable and leads to poor performance in certain scenarios, and the previous methods ignore differences between the model data, which brings great conservatism. To address the above issues, this article proposes a mildly conservative model-based offline RL algorithm (DOMAIN) without estimating model uncertainty, and designs the adaptive sampling distribution of model samples, which can adaptively adjust the model data penalty. In this article, we theoretically demonstrate that the Q value learned by the DOMAIN outside the region is a lower bound of the true Q value, the DOMAIN is less conservative than previous model-based offline RL algorithms, and has the guarantee of safety policy improvement. The results of extensive experiments show that DOMAIN outperforms prior RL algorithms and the average performance has improved by 1.8% on the D4RL benchmark.

查看原文本刊更多论文

领域：轻度保守的基于模型的离线强化学习

基于模型的强化学习（model -based reinforcement learning， RL）从离线数据集中学习环境模型，生成更多的分布外模型数据，已成为离线强化学习中解决分布移位问题的有效方法。由于学习到的数据与实际环境存在一定的差距，需要在算法中加入保守性来平衡准确的离线数据和不精确的模型数据。现有算法的保守性主要依赖于模型的不确定性估计。但是，不确定性估计是不可靠的，在某些场景下会导致性能不佳，并且之前的方法忽略了模型数据之间的差异，带来了很大的保守性。针对上述问题，本文在不估计模型不确定性的情况下，提出了一种基于轻度保守模型的离线RL算法（DOMAIN），并设计了模型样本的自适应采样分布，可以自适应调整模型数据惩罚。在本文中，我们从理论上证明了DOMAIN在区域外学习到的Q值是真实Q值的下界，DOMAIN比以前基于模型的离线RL算法更保守，并且具有安全策略改进的保证。大量的实验结果表明，DOMAIN优于先前的RL算法，在D4RL基准上的平均性能提高了1.8%。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

IEEE Transactions on Systems Man Cybernetics-Systems AUTOMATION & CONTROL SYSTEMS-COMPUTER SCIENCE, CYBERNETICS

CiteScore

18.50

自引率

11.50%

发文量

812

审稿时长

6 months

期刊介绍： The IEEE Transactions on Systems, Man, and Cybernetics: Systems encompasses the fields of systems engineering, covering issue formulation, analysis, and modeling throughout the systems engineering lifecycle phases. It addresses decision-making, issue interpretation, systems management, processes, and various methods such as optimization, modeling, and simulation in the development and deployment of large systems.