The Unintended Consequences of Discount Regularization: Improving Regularization in Certainty Equivalence Reinforcement Learning

Proceedings of machine learning research Pub Date : 2023-06-20 DOI:10.48550/arXiv.2306.11208

Sarah Rathnam, S. Parbhoo, Weiwei Pan, Susan A. Murphy, F. Doshi-Velez

{"title":"The Unintended Consequences of Discount Regularization: Improving Regularization in Certainty Equivalence Reinforcement Learning","authors":"Sarah Rathnam, S. Parbhoo, Weiwei Pan, Susan A. Murphy, F. Doshi-Velez","doi":"10.48550/arXiv.2306.11208","DOIUrl":null,"url":null,"abstract":"Discount regularization, using a shorter planning horizon when calculating the optimal policy, is a popular choice to restrict planning to a less complex set of policies when estimating an MDP from sparse or noisy data (Jiang et al., 2015). It is commonly understood that discount regularization functions by de-emphasizing or ignoring delayed effects. In this paper, we reveal an alternate view of discount regularization that exposes unintended consequences. We demonstrate that planning under a lower discount factor produces an identical optimal policy to planning using any prior on the transition matrix that has the same distribution for all states and actions. In fact, it functions like a prior with stronger regularization on state-action pairs with more transition data. This leads to poor performance when the transition matrix is estimated from data sets with uneven amounts of data across state-action pairs. Our equivalence theorem leads to an explicit formula to set regularization parameters locally for individual state-action pairs rather than globally. We demonstrate the failures of discount regularization and how we remedy them using our state-action-specific method across simple empirical examples as well as a medical cancer simulator.","PeriodicalId":74504,"journal":{"name":"Proceedings of machine learning research","volume":"202 1","pages":"28746-28767"},"PeriodicalIF":0.0000,"publicationDate":"2023-06-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of machine learning research","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.48550/arXiv.2306.11208","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 1

Abstract

Discount regularization, using a shorter planning horizon when calculating the optimal policy, is a popular choice to restrict planning to a less complex set of policies when estimating an MDP from sparse or noisy data (Jiang et al., 2015). It is commonly understood that discount regularization functions by de-emphasizing or ignoring delayed effects. In this paper, we reveal an alternate view of discount regularization that exposes unintended consequences. We demonstrate that planning under a lower discount factor produces an identical optimal policy to planning using any prior on the transition matrix that has the same distribution for all states and actions. In fact, it functions like a prior with stronger regularization on state-action pairs with more transition data. This leads to poor performance when the transition matrix is estimated from data sets with uneven amounts of data across state-action pairs. Our equivalence theorem leads to an explicit formula to set regularization parameters locally for individual state-action pairs rather than globally. We demonstrate the failures of discount regularization and how we remedy them using our state-action-specific method across simple empirical examples as well as a medical cancer simulator.

查看原文本刊更多论文

折扣正则化的意外后果:改进确定性等价强化学习中的正则化

折扣正则化，在计算最优策略时使用更短的规划范围，是一种流行的选择，当从稀疏或噪声数据估计MDP时，将规划限制在一组不太复杂的策略上(Jiang等人，2015)。人们通常认为，贴现正则化函数是通过不强调或忽略延迟效应来实现的。在本文中，我们揭示了折扣正则化的另一种观点，它暴露了意想不到的后果。我们证明了在较低折扣因子下的规划与在所有状态和动作具有相同分布的转移矩阵上使用任何先验的规划产生相同的最优策略。事实上，它的功能就像一个具有更强正则化的先验，对具有更多转移数据的状态-动作对。当从跨状态-动作对的数据量不均匀的数据集估计转移矩阵时，这会导致性能差。我们的等价定理给出了一个显式公式，可以为单个状态-动作对局部设置正则化参数，而不是全局设置。我们演示了折扣正则化的失败，以及我们如何通过简单的经验示例以及医疗癌症模拟器使用我们的状态-行动特定方法来补救它们。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Proceedings of machine learning research

自引率

0.00%

发文量