Policy Regularization with Dataset Constraint for Offline Reinforcement Learning

Proceedings of the ... International Conference on Machine Learning. International Conference on Machine Learning Pub Date : 2023-06-11 DOI:10.48550/arXiv.2306.06569

Yuhang Ran, Yichen Li, Fuxiang Zhang, Zongzhang Zhang, Yang Yu

{"title":"Policy Regularization with Dataset Constraint for Offline Reinforcement Learning","authors":"Yuhang Ran, Yichen Li, Fuxiang Zhang, Zongzhang Zhang, Yang Yu","doi":"10.48550/arXiv.2306.06569","DOIUrl":null,"url":null,"abstract":"We consider the problem of learning the best possible policy from a fixed dataset, known as offline Reinforcement Learning (RL). A common taxonomy of existing offline RL works is policy regularization, which typically constrains the learned policy by distribution or support of the behavior policy. However, distribution and support constraints are overly conservative since they both force the policy to choose similar actions as the behavior policy when considering particular states. It will limit the learned policy's performance, especially when the behavior policy is sub-optimal. In this paper, we find that regularizing the policy towards the nearest state-action pair can be more effective and thus propose Policy Regularization with Dataset Constraint (PRDC). When updating the policy in a given state, PRDC searches the entire dataset for the nearest state-action sample and then restricts the policy with the action of this sample. Unlike previous works, PRDC can guide the policy with proper behaviors from the dataset, allowing it to choose actions that do not appear in the dataset along with the given state. It is a softer constraint but still keeps enough conservatism from out-of-distribution actions. Empirical evidence and theoretical analysis show that PRDC can alleviate offline RL's fundamentally challenging value overestimation issue with a bounded performance gap. Moreover, on a set of locomotion and navigation tasks, PRDC achieves state-of-the-art performance compared with existing methods. Code is available at https://github.com/LAMDA-RL/PRDC","PeriodicalId":74529,"journal":{"name":"Proceedings of the ... International Conference on Machine Learning. International Conference on Machine Learning","volume":"28 1","pages":"28701-28717"},"PeriodicalIF":0.0000,"publicationDate":"2023-06-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the ... International Conference on Machine Learning. International Conference on Machine Learning","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.48550/arXiv.2306.06569","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

We consider the problem of learning the best possible policy from a fixed dataset, known as offline Reinforcement Learning (RL). A common taxonomy of existing offline RL works is policy regularization, which typically constrains the learned policy by distribution or support of the behavior policy. However, distribution and support constraints are overly conservative since they both force the policy to choose similar actions as the behavior policy when considering particular states. It will limit the learned policy's performance, especially when the behavior policy is sub-optimal. In this paper, we find that regularizing the policy towards the nearest state-action pair can be more effective and thus propose Policy Regularization with Dataset Constraint (PRDC). When updating the policy in a given state, PRDC searches the entire dataset for the nearest state-action sample and then restricts the policy with the action of this sample. Unlike previous works, PRDC can guide the policy with proper behaviors from the dataset, allowing it to choose actions that do not appear in the dataset along with the given state. It is a softer constraint but still keeps enough conservatism from out-of-distribution actions. Empirical evidence and theoretical analysis show that PRDC can alleviate offline RL's fundamentally challenging value overestimation issue with a bounded performance gap. Moreover, on a set of locomotion and navigation tasks, PRDC achieves state-of-the-art performance compared with existing methods. Code is available at https://github.com/LAMDA-RL/PRDC

查看原文本刊更多论文

基于数据集约束的离线强化学习策略正则化

我们考虑从固定数据集中学习最佳策略的问题，称为离线强化学习(RL)。现有离线RL工作的一个常见分类法是策略规范化，它通常通过分布或支持行为策略来约束学习策略。然而，分布和支持约束过于保守，因为在考虑特定状态时，它们都迫使策略选择与行为策略相似的操作。它会限制学习策略的性能，特别是当行为策略是次优的时候。在本文中，我们发现向最近的状态-动作对规范化策略可以更有效，因此提出了基于数据集约束的策略规范化(PRDC)。在给定状态下更新策略时，PRDC在整个数据集中搜索最接近的状态-动作样本，然后用该样本的动作限制策略。与以前的工作不同，PRDC可以使用数据集中的适当行为来指导策略，允许它选择不随给定状态出现在数据集中的操作。这是一种较软的约束，但仍然保持足够的保守性，避免超出分配范围的行为。实证和理论分析表明，PRDC可以缓解离线RL在有限绩效差距下的价值高估问题。此外，在一系列运动和导航任务上，与现有方法相比，PRDC达到了最先进的性能。代码可从https://github.com/LAMDA-RL/PRDC获得

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Proceedings of the ... International Conference on Machine Learning. International Conference on Machine Learning

自引率

0.00%

发文量