{"title":"Breaking the sample complexity barrier to regret-optimal model-free reinforcement learning","authors":"Gen Li;Laixi Shi;Yuxin Chen;Yuejie Chi","doi":"10.1093/imaiai/iaac034","DOIUrl":null,"url":null,"abstract":"Achieving sample efficiency in online episodic reinforcement learning (RL) requires optimally balancing exploration and exploitation. When it comes to a finite-horizon episodic Markov decision process with \n<tex>$S$</tex>\n states, \n<tex>$A$</tex>\n actions and horizon length \n<tex>$H$</tex>\n, substantial progress has been achieved toward characterizing the minimax-optimal regret, which scales on the order of \n<tex>$\\sqrt{H^2SAT}$</tex>\n (modulo log factors) with \n<tex>$T$</tex>\n the total number of samples. While several competing solution paradigms have been proposed to minimize regret, they are either memory-inefficient, or fall short of optimality unless the sample size exceeds an enormous threshold (e.g. \n<tex>$S^6A^4 \\,\\mathrm{poly}(H)$</tex>\n for existing model-free methods).To overcome such a large sample size barrier to efficient RL, we design a novel model-free algorithm, with space complexity \n<tex>$O(SAH)$</tex>\n, that achieves near-optimal regret as soon as the sample size exceeds the order of \n<tex>$SA\\,\\mathrm{poly}(H)$</tex>\n. In terms of this sample size requirement (also referred to the initial burn-in cost), our method improves—by at least a factor of \n<tex>$S^5A^3$</tex>\n—upon any prior memory-efficient algorithm that is asymptotically regret-optimal. Leveraging the recently introduced variance reduction strategy (also called reference-advantage decomposition), the proposed algorithm employs an early-settled reference update rule, with the aid of two Q-learning sequences with upper and lower confidence bounds. The design principle of our early-settled variance reduction method might be of independent interest to other RL settings that involve intricate exploration–exploitation trade-offs.","PeriodicalId":45437,"journal":{"name":"Information and Inference-A Journal of the Ima","volume":"12 2","pages":"969-1043"},"PeriodicalIF":1.4000,"publicationDate":"2022-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/iel7/8016800/10058586/10058618.pdf","citationCount":"35","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Information and Inference-A Journal of the Ima","FirstCategoryId":"100","ListUrlMain":"https://ieeexplore.ieee.org/document/10058618/","RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"MATHEMATICS, APPLIED","Score":null,"Total":0}
引用次数: 35
Abstract
Achieving sample efficiency in online episodic reinforcement learning (RL) requires optimally balancing exploration and exploitation. When it comes to a finite-horizon episodic Markov decision process with
$S$
states,
$A$
actions and horizon length
$H$
, substantial progress has been achieved toward characterizing the minimax-optimal regret, which scales on the order of
$\sqrt{H^2SAT}$
(modulo log factors) with
$T$
the total number of samples. While several competing solution paradigms have been proposed to minimize regret, they are either memory-inefficient, or fall short of optimality unless the sample size exceeds an enormous threshold (e.g.
$S^6A^4 \,\mathrm{poly}(H)$
for existing model-free methods).To overcome such a large sample size barrier to efficient RL, we design a novel model-free algorithm, with space complexity
$O(SAH)$
, that achieves near-optimal regret as soon as the sample size exceeds the order of
$SA\,\mathrm{poly}(H)$
. In terms of this sample size requirement (also referred to the initial burn-in cost), our method improves—by at least a factor of
$S^5A^3$
—upon any prior memory-efficient algorithm that is asymptotically regret-optimal. Leveraging the recently introduced variance reduction strategy (also called reference-advantage decomposition), the proposed algorithm employs an early-settled reference update rule, with the aid of two Q-learning sequences with upper and lower confidence bounds. The design principle of our early-settled variance reduction method might be of independent interest to other RL settings that involve intricate exploration–exploitation trade-offs.