连续时间线性-二次强化学习的熵正则优化调度

IF 2.2 2区数学 Q2 AUTOMATION & CONTROL SYSTEMS

SIAM Journal on Control and Optimization Pub Date : 2024-01-17 DOI:10.1137/22m1515744

Lukasz Szpruch, Tanut Treetanthiploet, Yufei Zhang

{"title":"连续时间线性-二次强化学习的熵正则优化调度","authors":"Lukasz Szpruch, Tanut Treetanthiploet, Yufei Zhang","doi":"10.1137/22m1515744","DOIUrl":null,"url":null,"abstract":"SIAM Journal on Control and Optimization, Volume 62, Issue 1, Page 135-166, February 2024. <br/> Abstract. This work uses the entropy-regularized relaxed stochastic control perspective as a principled framework for designing reinforcement learning (RL) algorithms. Herein, an agent interacts with the environment by generating noisy controls distributed according to the optimal relaxed policy. The noisy policies, on the one hand, explore the space and hence facilitate learning, but, on the other hand, they introduce bias by assigning a positive probability to nonoptimal actions. This exploration-exploitation trade-off is determined by the strength of entropy regularization. We study algorithms resulting from two entropy regularization formulations: the exploratory control approach, where entropy is added to the cost objective, and the proximal policy update approach, where entropy penalizes policy divergence between consecutive episodes. We focus on the finite horizon continuous-time linear-quadratic (LQ) RL problem, where a linear dynamics with unknown drift coefficients is controlled subject to quadratic costs. In this setting, both algorithms yield a Gaussian relaxed policy. We quantify the precise difference between the value functions of a Gaussian policy and its noisy evaluation and show that the execution noise must be independent across time. By tuning the frequency of sampling from relaxed policies and the parameter governing the strength of entropy regularization, we prove that the regret, for both learning algorithms, is of the order [math] (up to a logarithmic factor) over [math] episodes, matching the best known result from the literature.","PeriodicalId":49531,"journal":{"name":"SIAM Journal on Control and Optimization","volume":"75 1","pages":""},"PeriodicalIF":2.2000,"publicationDate":"2024-01-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Optimal Scheduling of Entropy Regularizer for Continuous-Time Linear-Quadratic Reinforcement Learning\",\"authors\":\"Lukasz Szpruch, Tanut Treetanthiploet, Yufei Zhang\",\"doi\":\"10.1137/22m1515744\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"SIAM Journal on Control and Optimization, Volume 62, Issue 1, Page 135-166, February 2024. <br/> Abstract. This work uses the entropy-regularized relaxed stochastic control perspective as a principled framework for designing reinforcement learning (RL) algorithms. Herein, an agent interacts with the environment by generating noisy controls distributed according to the optimal relaxed policy. The noisy policies, on the one hand, explore the space and hence facilitate learning, but, on the other hand, they introduce bias by assigning a positive probability to nonoptimal actions. This exploration-exploitation trade-off is determined by the strength of entropy regularization. We study algorithms resulting from two entropy regularization formulations: the exploratory control approach, where entropy is added to the cost objective, and the proximal policy update approach, where entropy penalizes policy divergence between consecutive episodes. We focus on the finite horizon continuous-time linear-quadratic (LQ) RL problem, where a linear dynamics with unknown drift coefficients is controlled subject to quadratic costs. In this setting, both algorithms yield a Gaussian relaxed policy. We quantify the precise difference between the value functions of a Gaussian policy and its noisy evaluation and show that the execution noise must be independent across time. By tuning the frequency of sampling from relaxed policies and the parameter governing the strength of entropy regularization, we prove that the regret, for both learning algorithms, is of the order [math] (up to a logarithmic factor) over [math] episodes, matching the best known result from the literature.\",\"PeriodicalId\":49531,\"journal\":{\"name\":\"SIAM Journal on Control and Optimization\",\"volume\":\"75 1\",\"pages\":\"\"},\"PeriodicalIF\":2.2000,\"publicationDate\":\"2024-01-17\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"SIAM Journal on Control and Optimization\",\"FirstCategoryId\":\"100\",\"ListUrlMain\":\"https://doi.org/10.1137/22m1515744\",\"RegionNum\":2,\"RegionCategory\":\"数学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"AUTOMATION & CONTROL SYSTEMS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"SIAM Journal on Control and Optimization","FirstCategoryId":"100","ListUrlMain":"https://doi.org/10.1137/22m1515744","RegionNum":2,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"AUTOMATION & CONTROL SYSTEMS","Score":null,"Total":0}

引用次数: 0

摘要

SIAM 控制与优化期刊》第 62 卷第 1 期第 135-166 页，2024 年 2 月。摘要本研究从熵规则化松弛随机控制的角度出发，为设计强化学习（RL）算法提供了一个原则性框架。在这里，代理通过根据最优松弛策略生成分布式噪声控制来与环境交互。一方面，噪声策略可以探索空间，从而促进学习，但另一方面，噪声策略会给非最佳行动分配正概率，从而引入偏差。这种探索与利用之间的权衡取决于熵正则化的强度。我们研究了两种熵正则化公式所产生的算法：探索控制法和近似策略更新法，前者将熵添加到成本目标中，后者则对连续事件之间的策略偏差进行惩罚。我们将重点放在有限视界连续时间线性-二次方（LQ）RL 问题上，在该问题中，具有未知漂移系数的线性动力学受到二次方成本的控制。在这种情况下，两种算法都能得到高斯松弛策略。我们量化了高斯策略的值函数与其噪声评估之间的精确差异，并证明了执行噪声必须是跨时间独立的。通过调整从松弛策略中采样的频率和管理熵正则化强度的参数，我们证明了这两种学习算法在[math]事件上的遗憾都是[math]数量级（达到对数因子），与文献中已知的最佳结果相吻合。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Optimal Scheduling of Entropy Regularizer for Continuous-Time Linear-Quadratic Reinforcement Learning

SIAM Journal on Control and Optimization, Volume 62, Issue 1, Page 135-166, February 2024.
Abstract. This work uses the entropy-regularized relaxed stochastic control perspective as a principled framework for designing reinforcement learning (RL) algorithms. Herein, an agent interacts with the environment by generating noisy controls distributed according to the optimal relaxed policy. The noisy policies, on the one hand, explore the space and hence facilitate learning, but, on the other hand, they introduce bias by assigning a positive probability to nonoptimal actions. This exploration-exploitation trade-off is determined by the strength of entropy regularization. We study algorithms resulting from two entropy regularization formulations: the exploratory control approach, where entropy is added to the cost objective, and the proximal policy update approach, where entropy penalizes policy divergence between consecutive episodes. We focus on the finite horizon continuous-time linear-quadratic (LQ) RL problem, where a linear dynamics with unknown drift coefficients is controlled subject to quadratic costs. In this setting, both algorithms yield a Gaussian relaxed policy. We quantify the precise difference between the value functions of a Gaussian policy and its noisy evaluation and show that the execution noise must be independent across time. By tuning the frequency of sampling from relaxed policies and the parameter governing the strength of entropy regularization, we prove that the regret, for both learning algorithms, is of the order [math] (up to a logarithmic factor) over [math] episodes, matching the best known result from the literature.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

SIAM Journal on Control and Optimization 数学-应用数学

CiteScore

4.00

自引率

4.50%

发文量

143

审稿时长

12 months

期刊介绍： SIAM Journal on Control and Optimization (SICON) publishes original research articles on the mathematics and applications of control theory and certain parts of optimization theory. Papers considered for publication must be significant at both the mathematical level and the level of applications or potential applications. Papers containing mostly routine mathematics or those with no discernible connection to control and systems theory or optimization will not be considered for publication. From time to time, the journal will also publish authoritative surveys of important subject areas in control theory and optimization whose level of maturity permits a clear and unified exposition. The broad areas mentioned above are intended to encompass a wide range of mathematical techniques and scientific, engineering, economic, and industrial applications. These include stochastic and deterministic methods in control, estimation, and identification of systems; modeling and realization of complex control systems; the numerical analysis and related computational methodology of control processes and allied issues; and the development of mathematical theories and techniques that give new insights into old problems or provide the basis for further progress in control theory and optimization. Within the field of optimization, the journal focuses on the parts that are relevant to dynamic and control systems. Contributions to numerical methodology are also welcome in accordance with these aims, especially as related to large-scale problems and decomposition as well as to fundamental questions of convergence and approximation.