Dynamic Regret Minimization for Control of Non-stationary Linear Dynamical Systems

Proceedings of the ACM on Measurement and Analysis of Computing Systems Pub Date : 2021-11-06 DOI:10.1145/3508029

Yuwei Luo, Varun Gupta, M. Kolar

{"title":"Dynamic Regret Minimization for Control of Non-stationary Linear Dynamical Systems","authors":"Yuwei Luo, Varun Gupta, M. Kolar","doi":"10.1145/3508029","DOIUrl":null,"url":null,"abstract":"We consider the problem of controlling a Linear Quadratic Regulator (LQR) system over a finite horizon T with fixed and known cost matrices Q,R, but unknown and non-stationary dynamics A_t, B_t. The sequence of dynamics matrices can be arbitrary, but with a total variation, V_T, assumed to be o(T) and unknown to the controller. Under the assumption that a sequence of stabilizing, but potentially sub-optimal controllers is available for all t, we present an algorithm that achieves the optimal dynamic regret of O(V_T^2/5 T^3/5 ). With piecewise constant dynamics, our algorithm achieves the optimal regret of O(sqrtST ) where S is the number of switches. The crux of our algorithm is an adaptive non-stationarity detection strategy, which builds on an approach recently developed for contextual Multi-armed Bandit problems. We also argue that non-adaptive forgetting (e.g., restarting or using sliding window learning with a static window size) may not be regret optimal for the LQR problem, even when the window size is optimally tuned with the knowledge of $V_T$. The main technical challenge in the analysis of our algorithm is to prove that the ordinary least squares (OLS) estimator has a small bias when the parameter to be estimated is non-stationary. Our analysis also highlights that the key motif driving the regret is that the LQR problem is in spirit a bandit problem with linear feedback and locally quadratic cost. This motif is more universal than the LQR problem itself, and therefore we believe our results should find wider application.","PeriodicalId":426760,"journal":{"name":"Proceedings of the ACM on Measurement and Analysis of Computing Systems","volume":"6 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-11-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"8","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the ACM on Measurement and Analysis of Computing Systems","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3508029","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 8

Abstract

We consider the problem of controlling a Linear Quadratic Regulator (LQR) system over a finite horizon T with fixed and known cost matrices Q,R, but unknown and non-stationary dynamics A_t, B_t. The sequence of dynamics matrices can be arbitrary, but with a total variation, V_T, assumed to be o(T) and unknown to the controller. Under the assumption that a sequence of stabilizing, but potentially sub-optimal controllers is available for all t, we present an algorithm that achieves the optimal dynamic regret of O(V_T^2/5 T^3/5 ). With piecewise constant dynamics, our algorithm achieves the optimal regret of O(sqrtST ) where S is the number of switches. The crux of our algorithm is an adaptive non-stationarity detection strategy, which builds on an approach recently developed for contextual Multi-armed Bandit problems. We also argue that non-adaptive forgetting (e.g., restarting or using sliding window learning with a static window size) may not be regret optimal for the LQR problem, even when the window size is optimally tuned with the knowledge of $V_T$. The main technical challenge in the analysis of our algorithm is to prove that the ordinary least squares (OLS) estimator has a small bias when the parameter to be estimated is non-stationary. Our analysis also highlights that the key motif driving the regret is that the LQR problem is in spirit a bandit problem with linear feedback and locally quadratic cost. This motif is more universal than the LQR problem itself, and therefore we believe our results should find wider application.

查看原文本刊更多论文

非平稳线性动力系统控制的动态遗憾最小化

研究一类线性二次型调节器(LQR)系统在有限视界T上的控制问题，该系统具有固定且已知的代价矩阵Q,R，但未知且非平稳的动态矩阵A_t, B_t。动力学矩阵的序列可以是任意的，但是总变化量V_T假设为0 (T)，并且控制器未知。假设对所有t都有一个稳定但可能次优的控制器序列，我们提出了一个实现最优动态后悔0 (V_T^2/5 t ^3/5)的算法。通过分段恒动态，我们的算法实现了0 (sqrtST)的最优遗憾，其中S为开关数。该算法的关键是自适应非平稳检测策略，该策略基于最近开发的上下文多臂强盗问题的方法。我们还认为，非自适应遗忘(例如，重新启动或使用静态窗口大小的滑动窗口学习)对于LQR问题可能不是遗憾的最佳选择，即使窗口大小是根据V_T的知识进行最佳调整。在分析我们的算法时，主要的技术挑战是证明当待估计的参数是非平稳时，普通最小二乘(OLS)估计量具有较小的偏差。我们的分析还强调了驱动后悔的关键主题是LQR问题在精神上是一个具有线性反馈和局部二次代价的强盗问题。这个基序比LQR问题本身更普遍，因此我们相信我们的结果应该有更广泛的应用。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Proceedings of the ACM on Measurement and Analysis of Computing Systems

CiteScore

3.20

自引率

0.00%

发文量