{"title":"Doubly Pessimistic Algorithms for Strictly Safe Off-Policy Optimization","authors":"Sanae Amani, Lin F. Yang","doi":"10.1109/CISS53076.2022.9751158","DOIUrl":null,"url":null,"abstract":"We study offline reinforcement learning (RL) in the presence of safety requirements: from a dataset collected a priori and without direct access to the true environment, learn an optimal policy that is guaranteed to respect the safety constraints. We address this problem by modeling the safety requirement as an unknown cost function of states and actions, whose expected value with respect to the policy must fall below a certain threshold. We then present an algorithm in the context of finite-horizon Markov decision processes (MDPs), termed Safe-DPVI that performs in a doubly pessimistic manner when 1) it constructs a conservative set of safe policies; and 2) when it selects a good policy from that conservative set. Without assuming the sufficient coverage of the dataset or any structure for the underlying MDPs, we establish a data-dependent upper bound on the suboptimality gap of the safe policy Safe-DPVI returns. We then specialize our results to linear MDPs with appropriate assumptions on dataset being well-explored. Both data-dependent and specialized bounds nearly match that of state-of-the-art unsafe offline RL algorithms, with an additional multiplicative factor $\\frac{\\Sigma_{h=1}^{H}\\alpha_{h}}{H}$, where αh characterizes the safety constraint at time-step $h$. We further present numerical simulations that corroborate our theoretical findings. A full version referred to as technical report of this paper is accessible at: https://offline-rl-neurips.github.io/2021/pdf/21.pdf","PeriodicalId":305918,"journal":{"name":"2022 56th Annual Conference on Information Sciences and Systems (CISS)","volume":"22 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-03-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"3","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2022 56th Annual Conference on Information Sciences and Systems (CISS)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/CISS53076.2022.9751158","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 3
Abstract
We study offline reinforcement learning (RL) in the presence of safety requirements: from a dataset collected a priori and without direct access to the true environment, learn an optimal policy that is guaranteed to respect the safety constraints. We address this problem by modeling the safety requirement as an unknown cost function of states and actions, whose expected value with respect to the policy must fall below a certain threshold. We then present an algorithm in the context of finite-horizon Markov decision processes (MDPs), termed Safe-DPVI that performs in a doubly pessimistic manner when 1) it constructs a conservative set of safe policies; and 2) when it selects a good policy from that conservative set. Without assuming the sufficient coverage of the dataset or any structure for the underlying MDPs, we establish a data-dependent upper bound on the suboptimality gap of the safe policy Safe-DPVI returns. We then specialize our results to linear MDPs with appropriate assumptions on dataset being well-explored. Both data-dependent and specialized bounds nearly match that of state-of-the-art unsafe offline RL algorithms, with an additional multiplicative factor $\frac{\Sigma_{h=1}^{H}\alpha_{h}}{H}$, where αh characterizes the safety constraint at time-step $h$. We further present numerical simulations that corroborate our theoretical findings. A full version referred to as technical report of this paper is accessible at: https://offline-rl-neurips.github.io/2021/pdf/21.pdf