{"title":"Breaking the sample complexity barrier to regret-optimal model-free reinforcement learning","authors":"Gen Li;Laixi Shi;Yuxin Chen;Yuejie Chi","doi":"10.1093/imaiai/iaac034","DOIUrl":null,"url":null,"abstract":"Achieving sample efficiency in online episodic reinforcement learning (RL) requires optimally balancing exploration and exploitation. When it comes to a finite-horizon episodic Markov decision process with \n<tex>$S$</tex>\n states, \n<tex>$A$</tex>\n actions and horizon length \n<tex>$H$</tex>\n, substantial progress has been achieved toward characterizing the minimax-optimal regret, which scales on the order of \n<tex>$\\sqrt{H^2SAT}$</tex>\n (modulo log factors) with \n<tex>$T$</tex>\n the total number of samples. While several competing solution paradigms have been proposed to minimize regret, they are either memory-inefficient, or fall short of optimality unless the sample size exceeds an enormous threshold (e.g. \n<tex>$S^6A^4 \\,\\mathrm{poly}(H)$</tex>\n for existing model-free methods).To overcome such a large sample size barrier to efficient RL, we design a novel model-free algorithm, with space complexity \n<tex>$O(SAH)$</tex>\n, that achieves near-optimal regret as soon as the sample size exceeds the order of \n<tex>$SA\\,\\mathrm{poly}(H)$</tex>\n. In terms of this sample size requirement (also referred to the initial burn-in cost), our method improves—by at least a factor of \n<tex>$S^5A^3$</tex>\n—upon any prior memory-efficient algorithm that is asymptotically regret-optimal. Leveraging the recently introduced variance reduction strategy (also called reference-advantage decomposition), the proposed algorithm employs an early-settled reference update rule, with the aid of two Q-learning sequences with upper and lower confidence bounds. The design principle of our early-settled variance reduction method might be of independent interest to other RL settings that involve intricate exploration–exploitation trade-offs.","PeriodicalId":1,"journal":{"name":"Accounts of Chemical Research","volume":null,"pages":null},"PeriodicalIF":16.4000,"publicationDate":"2022-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/iel7/8016800/10058586/10058618.pdf","citationCount":"35","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Accounts of Chemical Research","FirstCategoryId":"100","ListUrlMain":"https://ieeexplore.ieee.org/document/10058618/","RegionNum":1,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"CHEMISTRY, MULTIDISCIPLINARY","Score":null,"Total":0}
引用次数: 35
Abstract
Achieving sample efficiency in online episodic reinforcement learning (RL) requires optimally balancing exploration and exploitation. When it comes to a finite-horizon episodic Markov decision process with
$S$
states,
$A$
actions and horizon length
$H$
, substantial progress has been achieved toward characterizing the minimax-optimal regret, which scales on the order of
$\sqrt{H^2SAT}$
(modulo log factors) with
$T$
the total number of samples. While several competing solution paradigms have been proposed to minimize regret, they are either memory-inefficient, or fall short of optimality unless the sample size exceeds an enormous threshold (e.g.
$S^6A^4 \,\mathrm{poly}(H)$
for existing model-free methods).To overcome such a large sample size barrier to efficient RL, we design a novel model-free algorithm, with space complexity
$O(SAH)$
, that achieves near-optimal regret as soon as the sample size exceeds the order of
$SA\,\mathrm{poly}(H)$
. In terms of this sample size requirement (also referred to the initial burn-in cost), our method improves—by at least a factor of
$S^5A^3$
—upon any prior memory-efficient algorithm that is asymptotically regret-optimal. Leveraging the recently introduced variance reduction strategy (also called reference-advantage decomposition), the proposed algorithm employs an early-settled reference update rule, with the aid of two Q-learning sequences with upper and lower confidence bounds. The design principle of our early-settled variance reduction method might be of independent interest to other RL settings that involve intricate exploration–exploitation trade-offs.
期刊介绍:
Accounts of Chemical Research presents short, concise and critical articles offering easy-to-read overviews of basic research and applications in all areas of chemistry and biochemistry. These short reviews focus on research from the author’s own laboratory and are designed to teach the reader about a research project. In addition, Accounts of Chemical Research publishes commentaries that give an informed opinion on a current research problem. Special Issues online are devoted to a single topic of unusual activity and significance.
Accounts of Chemical Research replaces the traditional article abstract with an article "Conspectus." These entries synopsize the research affording the reader a closer look at the content and significance of an article. Through this provision of a more detailed description of the article contents, the Conspectus enhances the article's discoverability by search engines and the exposure for the research.