Offline Reinforcement Learning with On-Policy Q-Function Regularization

Machine learning and knowledge discovery in databases : European Conference, ECML PKDD ... : proceedings. ECML PKDD (Conference) Pub Date : 2023-07-25 DOI:10.48550/arXiv.2307.13824

Laixi Shi, Robert Dadashi, Yuejie Chi, P. S. Castro, M. Geist

引用次数: 0

Abstract

The core challenge of offline reinforcement learning (RL) is dealing with the (potentially catastrophic) extrapolation error induced by the distribution shift between the history dataset and the desired policy. A large portion of prior work tackles this challenge by implicitly/explicitly regularizing the learning policy towards the behavior policy, which is hard to estimate reliably in practice. In this work, we propose to regularize towards the Q-function of the behavior policy instead of the behavior policy itself, under the premise that the Q-function can be estimated more reliably and easily by a SARSA-style estimate and handles the extrapolation error more straightforwardly. We propose two algorithms taking advantage of the estimated Q-function through regularizations, and demonstrate they exhibit strong performance on the D4RL benchmarks.

查看原文本刊更多论文

基于策略q函数正则化的离线强化学习

离线强化学习(RL)的核心挑战是处理由历史数据集和期望策略之间的分布变化引起的(潜在的灾难性)外推误差。先前的大部分工作通过隐式/显式地将学习策略规范化到行为策略来解决这一挑战，这在实践中很难可靠地估计。在这项工作中，我们提出对行为策略的q函数而不是行为策略本身进行正则化，前提是q函数可以通过sarsa式估计更可靠和容易地估计，并且更直接地处理外推误差。我们通过正则化提出了两种利用估计q函数的算法，并证明它们在D4RL基准测试中表现出强大的性能。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Machine learning and knowledge discovery in databases : European Conference, ECML PKDD ... : proceedings. ECML PKDD (Conference)

自引率

0.00%

发文量