Logarithmic Regret for Episodic Continuous-Time Linear-Quadratic Reinforcement Learning Over a Finite-Time Horizon

Computation Theory eJournal Pub Date : 2020-06-27 DOI:10.2139/ssrn.3848428

Matteo Basei, Xin Guo, Anran Hu, Yufei Zhang

引用次数: 23

Abstract

We study finite-time horizon continuous-time linear-quadratic reinforcement learning problems in an episodic setting, where both the state and control coefficients are unknown to the controller. We first propose a least-squares algorithm based on continuous-time observations and controls, and establish a logarithmic regret bound of order $O((\ln M)(\ln\ln M))$, with $M$ being the number of learning episodes. The analysis consists of two parts: perturbation analysis, which exploits the regularity and robustness of the associated Riccati differential equation; and parameter estimation error, which relies on sub-exponential properties of continuous-time least-squares estimators. We further propose a practically implementable least-squares algorithm based on discrete-time observations and piecewise constant controls, which achieves similar logarithmic regret with an additional term depending explicitly on the time stepsizes used in the algorithm.

查看原文本刊更多论文

有限时间范围内情景连续时间线性二次强化学习的对数后悔

我们研究了在情景设置中的有限时间视界连续时间线性二次强化学习问题，其中状态和控制系数对控制器都是未知的。我们首先提出了一种基于连续时间观察和控制的最小二乘算法，并建立了阶为$O((\ln M)(\ln\ln M))$的对数遗憾界，其中$M$为学习集的数量。分析由两部分组成:微扰分析，利用相关Riccati微分方程的正则性和鲁棒性;参数估计误差依赖于连续时间最小二乘估计的次指数性质。我们进一步提出了一种实际可实现的基于离散时间观测和分段常数控制的最小二乘算法，该算法实现了类似的对数遗憾，并明确地根据算法中使用的时间步长增加了一个附加项。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Computation Theory eJournal

自引率

0.00%

发文量