J. S. van Hulst, W. P. M. H. Heemels, D. J. Antunes
{"title":"Data-Efficient Quadratic Q-Learning Using LMIs","authors":"J. S. van Hulst, W. P. M. H. Heemels, D. J. Antunes","doi":"arxiv-2409.11986","DOIUrl":null,"url":null,"abstract":"Reinforcement learning (RL) has seen significant research and application\nresults but often requires large amounts of training data. This paper proposes\ntwo data-efficient off-policy RL methods that use parametrized Q-learning. In\nthese methods, the Q-function is chosen to be linear in the parameters and\nquadratic in selected basis functions in the state and control deviations from\na base policy. A cost penalizing the $\\ell_1$-norm of Bellman errors is\nminimized. We propose two methods: Linear Matrix Inequality Q-Learning (LMI-QL)\nand its iterative variant (LMI-QLi), which solve the resulting episodic\noptimization problem through convex optimization. LMI-QL relies on a convex\nrelaxation that yields a semidefinite programming (SDP) problem with linear\nmatrix inequalities (LMIs). LMI-QLi entails solving sequential iterations of an\nSDP problem. Both methods combine convex optimization with direct Q-function\nlearning, significantly improving learning speed. A numerical case study\ndemonstrates their advantages over existing parametrized Q-learning methods.","PeriodicalId":501175,"journal":{"name":"arXiv - EE - Systems and Control","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - EE - Systems and Control","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2409.11986","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
Reinforcement learning (RL) has seen significant research and application
results but often requires large amounts of training data. This paper proposes
two data-efficient off-policy RL methods that use parametrized Q-learning. In
these methods, the Q-function is chosen to be linear in the parameters and
quadratic in selected basis functions in the state and control deviations from
a base policy. A cost penalizing the $\ell_1$-norm of Bellman errors is
minimized. We propose two methods: Linear Matrix Inequality Q-Learning (LMI-QL)
and its iterative variant (LMI-QLi), which solve the resulting episodic
optimization problem through convex optimization. LMI-QL relies on a convex
relaxation that yields a semidefinite programming (SDP) problem with linear
matrix inequalities (LMIs). LMI-QLi entails solving sequential iterations of an
SDP problem. Both methods combine convex optimization with direct Q-function
learning, significantly improving learning speed. A numerical case study
demonstrates their advantages over existing parametrized Q-learning methods.