Corruption-Robust Exploration in Episodic Reinforcement Learning

IF 1.9 3区数学 Q2 MATHEMATICS, APPLIED

Mathematics of Operations Research Pub Date : 2024-05-23 DOI:10.1287/moor.2021.0202

Thodoris Lykouris, Max Simchowitz, Aleksandrs Slivkins, Wen Sun

{"title":"Corruption-Robust Exploration in Episodic Reinforcement Learning","authors":"Thodoris Lykouris, Max Simchowitz, Aleksandrs Slivkins, Wen Sun","doi":"10.1287/moor.2021.0202","DOIUrl":null,"url":null,"abstract":"We initiate the study of episodic reinforcement learning (RL) under adversarial corruptions in both the rewards and the transition probabilities of the underlying system, extending recent results for the special case of multiarmed bandits. We provide a framework that modifies the aggressive exploration enjoyed by existing reinforcement learning approaches based on optimism in the face of uncertainty by complementing them with principles from action elimination. Importantly, our framework circumvents the major challenges posed by naively applying action elimination in the RL setting, as formalized by a lower bound we demonstrate. Our framework yields efficient algorithms that (a) attain near-optimal regret in the absence of corruptions and (b) adapt to unknown levels of corruption, enjoying regret guarantees that degrade gracefully in the total corruption encountered. To showcase the generality of our approach, we derive results for both tabular settings (where states and actions are finite) and linear Markov decision process settings (where the dynamics and rewards admit a linear underlying representation). Notably, our work provides the first sublinear regret guarantee that accommodates any deviation from purely independent and identically distributed transitions in the bandit-feedback model for episodic reinforcement learning.Supplemental Material: The online appendix is available at https://doi.org/10.1287/moor.2021.0202 .","PeriodicalId":49852,"journal":{"name":"Mathematics of Operations Research","volume":"61 1","pages":""},"PeriodicalIF":1.9000,"publicationDate":"2024-05-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Mathematics of Operations Research","FirstCategoryId":"100","ListUrlMain":"https://doi.org/10.1287/moor.2021.0202","RegionNum":3,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"MATHEMATICS, APPLIED","Score":null,"Total":0}

引用次数: 0

Abstract

We initiate the study of episodic reinforcement learning (RL) under adversarial corruptions in both the rewards and the transition probabilities of the underlying system, extending recent results for the special case of multiarmed bandits. We provide a framework that modifies the aggressive exploration enjoyed by existing reinforcement learning approaches based on optimism in the face of uncertainty by complementing them with principles from action elimination. Importantly, our framework circumvents the major challenges posed by naively applying action elimination in the RL setting, as formalized by a lower bound we demonstrate. Our framework yields efficient algorithms that (a) attain near-optimal regret in the absence of corruptions and (b) adapt to unknown levels of corruption, enjoying regret guarantees that degrade gracefully in the total corruption encountered. To showcase the generality of our approach, we derive results for both tabular settings (where states and actions are finite) and linear Markov decision process settings (where the dynamics and rewards admit a linear underlying representation). Notably, our work provides the first sublinear regret guarantee that accommodates any deviation from purely independent and identically distributed transitions in the bandit-feedback model for episodic reinforcement learning.Supplemental Material: The online appendix is available at https://doi.org/10.1287/moor.2021.0202 .

查看原文本刊更多论文

情节强化学习中的腐败-稳健探索

我们开始研究在底层系统的奖励和过渡概率都受到对抗性破坏的情况下的偶发强化学习（RL），并扩展了最近针对多臂匪徒特例的研究成果。我们提供了一个框架，通过对行动消除原理的补充，修正了现有强化学习方法在面对不确定性时基于乐观主义的积极探索。重要的是，我们的框架规避了在 RL 环境中天真地应用行动消除所带来的主要挑战，这一点通过我们展示的一个下限得到了正式体现。我们的框架能产生高效的算法，这些算法（a）在没有腐败的情况下能达到近乎最优的遗憾值，（b）能适应未知程度的腐败，并能保证遗憾值在所遇到的腐败总量中优雅地递减。为了展示我们方法的通用性，我们推导出了表格设置（其中状态和行动都是有限的）和线性马尔可夫决策过程设置（其中动态和奖励采用线性底层表示）的结果。值得注意的是，我们的研究首次提供了亚线性遗憾保证，这种保证可以在偶发强化学习的匪徒反馈模型中适应任何偏离纯独立同分布转换的情况：在线附录见 https://doi.org/10.1287/moor.2021.0202 。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Mathematics of Operations Research 管理科学-应用数学

CiteScore

3.40

自引率

5.90%

发文量

178

审稿时长

15.0 months

期刊介绍： Mathematics of Operations Research is an international journal of the Institute for Operations Research and the Management Sciences (INFORMS). The journal invites articles concerned with the mathematical and computational foundations in the areas of continuous, discrete, and stochastic optimization; mathematical programming; dynamic programming; stochastic processes; stochastic models; simulation methodology; control and adaptation; networks; game theory; and decision theory. Also sought are contributions to learning theory and machine learning that have special relevance to decision making, operations research, and management science. The emphasis is on originality, quality, and importance; correctness alone is not sufficient. Significant developments in operations research and management science not having substantial mathematical interest should be directed to other journals such as Management Science or Operations Research.