{"title":"非平稳线性动力系统控制的动态遗憾最小化","authors":"Yuwei Luo, Varun Gupta, M. Kolar","doi":"10.1145/3508029","DOIUrl":null,"url":null,"abstract":"We consider the problem of controlling a Linear Quadratic Regulator (LQR) system over a finite horizon T with fixed and known cost matrices Q,R, but unknown and non-stationary dynamics A_t, B_t. The sequence of dynamics matrices can be arbitrary, but with a total variation, V_T, assumed to be o(T) and unknown to the controller. Under the assumption that a sequence of stabilizing, but potentially sub-optimal controllers is available for all t, we present an algorithm that achieves the optimal dynamic regret of O(V_T^2/5 T^3/5 ). With piecewise constant dynamics, our algorithm achieves the optimal regret of O(sqrtST ) where S is the number of switches. The crux of our algorithm is an adaptive non-stationarity detection strategy, which builds on an approach recently developed for contextual Multi-armed Bandit problems. We also argue that non-adaptive forgetting (e.g., restarting or using sliding window learning with a static window size) may not be regret optimal for the LQR problem, even when the window size is optimally tuned with the knowledge of $V_T$. The main technical challenge in the analysis of our algorithm is to prove that the ordinary least squares (OLS) estimator has a small bias when the parameter to be estimated is non-stationary. Our analysis also highlights that the key motif driving the regret is that the LQR problem is in spirit a bandit problem with linear feedback and locally quadratic cost. This motif is more universal than the LQR problem itself, and therefore we believe our results should find wider application.","PeriodicalId":426760,"journal":{"name":"Proceedings of the ACM on Measurement and Analysis of Computing Systems","volume":"6 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-11-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"8","resultStr":"{\"title\":\"Dynamic Regret Minimization for Control of Non-stationary Linear Dynamical Systems\",\"authors\":\"Yuwei Luo, Varun Gupta, M. Kolar\",\"doi\":\"10.1145/3508029\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"We consider the problem of controlling a Linear Quadratic Regulator (LQR) system over a finite horizon T with fixed and known cost matrices Q,R, but unknown and non-stationary dynamics A_t, B_t. The sequence of dynamics matrices can be arbitrary, but with a total variation, V_T, assumed to be o(T) and unknown to the controller. Under the assumption that a sequence of stabilizing, but potentially sub-optimal controllers is available for all t, we present an algorithm that achieves the optimal dynamic regret of O(V_T^2/5 T^3/5 ). With piecewise constant dynamics, our algorithm achieves the optimal regret of O(sqrtST ) where S is the number of switches. The crux of our algorithm is an adaptive non-stationarity detection strategy, which builds on an approach recently developed for contextual Multi-armed Bandit problems. We also argue that non-adaptive forgetting (e.g., restarting or using sliding window learning with a static window size) may not be regret optimal for the LQR problem, even when the window size is optimally tuned with the knowledge of $V_T$. The main technical challenge in the analysis of our algorithm is to prove that the ordinary least squares (OLS) estimator has a small bias when the parameter to be estimated is non-stationary. Our analysis also highlights that the key motif driving the regret is that the LQR problem is in spirit a bandit problem with linear feedback and locally quadratic cost. This motif is more universal than the LQR problem itself, and therefore we believe our results should find wider application.\",\"PeriodicalId\":426760,\"journal\":{\"name\":\"Proceedings of the ACM on Measurement and Analysis of Computing Systems\",\"volume\":\"6 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2021-11-06\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"8\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the ACM on Measurement and Analysis of Computing Systems\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3508029\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the ACM on Measurement and Analysis of Computing Systems","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3508029","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 8
摘要
研究一类线性二次型调节器(LQR)系统在有限视界T上的控制问题,该系统具有固定且已知的代价矩阵Q,R,但未知且非平稳的动态矩阵A_t, B_t。动力学矩阵的序列可以是任意的,但是总变化量V_T假设为0 (T),并且控制器未知。假设对所有t都有一个稳定但可能次优的控制器序列,我们提出了一个实现最优动态后悔0 (V_T^2/5 t ^3/5)的算法。通过分段恒动态,我们的算法实现了0 (sqrtST)的最优遗憾,其中S为开关数。该算法的关键是自适应非平稳检测策略,该策略基于最近开发的上下文多臂强盗问题的方法。我们还认为,非自适应遗忘(例如,重新启动或使用静态窗口大小的滑动窗口学习)对于LQR问题可能不是遗憾的最佳选择,即使窗口大小是根据V_T的知识进行最佳调整。在分析我们的算法时,主要的技术挑战是证明当待估计的参数是非平稳时,普通最小二乘(OLS)估计量具有较小的偏差。我们的分析还强调了驱动后悔的关键主题是LQR问题在精神上是一个具有线性反馈和局部二次代价的强盗问题。这个基序比LQR问题本身更普遍,因此我们相信我们的结果应该有更广泛的应用。
Dynamic Regret Minimization for Control of Non-stationary Linear Dynamical Systems
We consider the problem of controlling a Linear Quadratic Regulator (LQR) system over a finite horizon T with fixed and known cost matrices Q,R, but unknown and non-stationary dynamics A_t, B_t. The sequence of dynamics matrices can be arbitrary, but with a total variation, V_T, assumed to be o(T) and unknown to the controller. Under the assumption that a sequence of stabilizing, but potentially sub-optimal controllers is available for all t, we present an algorithm that achieves the optimal dynamic regret of O(V_T^2/5 T^3/5 ). With piecewise constant dynamics, our algorithm achieves the optimal regret of O(sqrtST ) where S is the number of switches. The crux of our algorithm is an adaptive non-stationarity detection strategy, which builds on an approach recently developed for contextual Multi-armed Bandit problems. We also argue that non-adaptive forgetting (e.g., restarting or using sliding window learning with a static window size) may not be regret optimal for the LQR problem, even when the window size is optimally tuned with the knowledge of $V_T$. The main technical challenge in the analysis of our algorithm is to prove that the ordinary least squares (OLS) estimator has a small bias when the parameter to be estimated is non-stationary. Our analysis also highlights that the key motif driving the regret is that the LQR problem is in spirit a bandit problem with linear feedback and locally quadratic cost. This motif is more universal than the LQR problem itself, and therefore we believe our results should find wider application.