连续时间线性二次系统确定性等价策略的遗憾分析

International Conference on System Theory, Control and Computing Pub Date : 2022-06-09 DOI:10.48550/arXiv.2206.04434

Mohamad Kazem Shirani Faradonbeh

{"title":"连续时间线性二次系统确定性等价策略的遗憾分析","authors":"Mohamad Kazem Shirani Faradonbeh","doi":"10.48550/arXiv.2206.04434","DOIUrl":null,"url":null,"abstract":"This work theoretically studies a ubiquitous reinforcement learning policy for controlling the canonical model of continuous-time stochastic linear-quadratic systems. We show that randomized certainty equivalent policy addresses the exploration-exploitation dilemma in linear control systems that evolve according to unknown stochastic differential equations and their operating cost is quadratic. More precisely, we establish square-root of time regret bounds, indicating that randomized certainty equivalent policy learns optimal control actions fast from a single state trajectory. Further, linear scaling of the regret with the number of parameters is shown. The presented analysis introduces novel and useful technical approaches, and sheds light on fundamental challenges of continuous-time reinforcement learning.","PeriodicalId":347792,"journal":{"name":"International Conference on System Theory, Control and Computing","volume":"102 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-06-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Regret Analysis of Certainty Equivalence Policies in Continuous-Time Linear-Quadratic Systems\",\"authors\":\"Mohamad Kazem Shirani Faradonbeh\",\"doi\":\"10.48550/arXiv.2206.04434\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"This work theoretically studies a ubiquitous reinforcement learning policy for controlling the canonical model of continuous-time stochastic linear-quadratic systems. We show that randomized certainty equivalent policy addresses the exploration-exploitation dilemma in linear control systems that evolve according to unknown stochastic differential equations and their operating cost is quadratic. More precisely, we establish square-root of time regret bounds, indicating that randomized certainty equivalent policy learns optimal control actions fast from a single state trajectory. Further, linear scaling of the regret with the number of parameters is shown. The presented analysis introduces novel and useful technical approaches, and sheds light on fundamental challenges of continuous-time reinforcement learning.\",\"PeriodicalId\":347792,\"journal\":{\"name\":\"International Conference on System Theory, Control and Computing\",\"volume\":\"102 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2022-06-09\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"International Conference on System Theory, Control and Computing\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.48550/arXiv.2206.04434\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"International Conference on System Theory, Control and Computing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.48550/arXiv.2206.04434","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

本文从理论上研究了一种用于控制连续时间随机线性二次系统规范模型的泛在强化学习策略。研究表明，随机确定性等价策略解决了运行成本为二次元的未知随机微分方程线性控制系统的勘探开采困境。更准确地说，我们建立了时间后悔边界的平方根，表明随机确定性等效策略从单状态轨迹中快速学习到最优控制动作。进一步，显示了后悔与参数数量的线性比例。所提出的分析介绍了新颖而有用的技术方法，并阐明了持续时间强化学习的基本挑战。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Regret Analysis of Certainty Equivalence Policies in Continuous-Time Linear-Quadratic Systems

This work theoretically studies a ubiquitous reinforcement learning policy for controlling the canonical model of continuous-time stochastic linear-quadratic systems. We show that randomized certainty equivalent policy addresses the exploration-exploitation dilemma in linear control systems that evolve according to unknown stochastic differential equations and their operating cost is quadratic. More precisely, we establish square-root of time regret bounds, indicating that randomized certainty equivalent policy learns optimal control actions fast from a single state trajectory. Further, linear scaling of the regret with the number of parameters is shown. The presented analysis introduces novel and useful technical approaches, and sheds light on fundamental challenges of continuous-time reinforcement learning.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

International Conference on System Theory, Control and Computing

自引率

0.00%

发文量