Zhiyong Peng , Changlin Han , Yadong Liu , Jingsheng Tang , Zongtan Zhou
{"title":"Pessimistic policy iteration with bounded uncertainty","authors":"Zhiyong Peng , Changlin Han , Yadong Liu , Jingsheng Tang , Zongtan Zhou","doi":"10.1016/j.eswa.2025.127651","DOIUrl":null,"url":null,"abstract":"<div><div>Offline Reinforcement Learning (RL) aims to learn policies by using static datasets. The extrapolation error in out-of-distribution (OOD) samples can cause off-policy RL algorithms to perform poorly on offline datasets. Hence, it is critical to avoid visiting OOD states and taking OOD actions in offline RL. Several recent methods have used uncertainty estimation to distinguish OOD samples. However, errors in the uncertainty estimation make the purely uncertainty-based method unstable and require additional components to ensure sufficient pessimism. In this study, we propose a Bounded Uncertainty based Pessimistic policy iteration algorithm (BUP). The BUP pessimistically estimates the value function via bounded uncertainty, and the uncertainty bound is achieved by constraining the actor from taking highly uncertain actions. The suboptimality bound of BUP is theoretically guaranteed in linear Markov Decision Processes (MDPs), and experiments on D4RL datasets show that BUP matches the state-of-the-art performance. Moreover, BUP is simple to implement with low computational cost and does not require any additional components.</div></div>","PeriodicalId":50461,"journal":{"name":"Expert Systems with Applications","volume":"282 ","pages":"Article 127651"},"PeriodicalIF":7.5000,"publicationDate":"2025-04-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Expert Systems with Applications","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0957417425012734","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
引用次数: 0
Abstract
Offline Reinforcement Learning (RL) aims to learn policies by using static datasets. The extrapolation error in out-of-distribution (OOD) samples can cause off-policy RL algorithms to perform poorly on offline datasets. Hence, it is critical to avoid visiting OOD states and taking OOD actions in offline RL. Several recent methods have used uncertainty estimation to distinguish OOD samples. However, errors in the uncertainty estimation make the purely uncertainty-based method unstable and require additional components to ensure sufficient pessimism. In this study, we propose a Bounded Uncertainty based Pessimistic policy iteration algorithm (BUP). The BUP pessimistically estimates the value function via bounded uncertainty, and the uncertainty bound is achieved by constraining the actor from taking highly uncertain actions. The suboptimality bound of BUP is theoretically guaranteed in linear Markov Decision Processes (MDPs), and experiments on D4RL datasets show that BUP matches the state-of-the-art performance. Moreover, BUP is simple to implement with low computational cost and does not require any additional components.
期刊介绍:
Expert Systems With Applications is an international journal dedicated to the exchange of information on expert and intelligent systems used globally in industry, government, and universities. The journal emphasizes original papers covering the design, development, testing, implementation, and management of these systems, offering practical guidelines. It spans various sectors such as finance, engineering, marketing, law, project management, information management, medicine, and more. The journal also welcomes papers on multi-agent systems, knowledge management, neural networks, knowledge discovery, data mining, and other related areas, excluding applications to military/defense systems.