Gabor Paczolay, Matteo Papini, Alberto Maria Metelli, Istvan Harmati, Marcello Restelli
{"title":"Sample complexity of variance-reduced policy gradient: weaker assumptions and lower bounds","authors":"Gabor Paczolay, Matteo Papini, Alberto Maria Metelli, Istvan Harmati, Marcello Restelli","doi":"10.1007/s10994-024-06573-4","DOIUrl":null,"url":null,"abstract":"<p>Several variance-reduced versions of REINFORCE based on importance sampling achieve an improved <span>\\(O(\\epsilon ^{-3})\\)</span> sample complexity to find an <span>\\(\\epsilon\\)</span>-stationary point, under an unrealistic assumption on the variance of the importance weights. In this paper, we propose the Defensive Policy Gradient (DEF-PG) algorithm, based on defensive importance sampling, achieving the same result without any assumption on the variance of the importance weights. We also show that this is not improvable by establishing a matching <span>\\(\\Omega (\\epsilon ^{-3})\\)</span> lower bound, and that REINFORCE with its <span>\\(O(\\epsilon ^{-4})\\)</span> sample complexity is actually optimal under weaker assumptions on the policy class. Numerical simulations show promising results for the proposed technique compared to similar algorithms based on vanilla importance sampling.</p>","PeriodicalId":49900,"journal":{"name":"Machine Learning","volume":"24 1","pages":""},"PeriodicalIF":4.3000,"publicationDate":"2024-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Machine Learning","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1007/s10994-024-06573-4","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
引用次数: 0
Abstract
Several variance-reduced versions of REINFORCE based on importance sampling achieve an improved \(O(\epsilon ^{-3})\) sample complexity to find an \(\epsilon\)-stationary point, under an unrealistic assumption on the variance of the importance weights. In this paper, we propose the Defensive Policy Gradient (DEF-PG) algorithm, based on defensive importance sampling, achieving the same result without any assumption on the variance of the importance weights. We also show that this is not improvable by establishing a matching \(\Omega (\epsilon ^{-3})\) lower bound, and that REINFORCE with its \(O(\epsilon ^{-4})\) sample complexity is actually optimal under weaker assumptions on the policy class. Numerical simulations show promising results for the proposed technique compared to similar algorithms based on vanilla importance sampling.
期刊介绍:
Machine Learning serves as a global platform dedicated to computational approaches in learning. The journal reports substantial findings on diverse learning methods applied to various problems, offering support through empirical studies, theoretical analysis, or connections to psychological phenomena. It demonstrates the application of learning methods to solve significant problems and aims to enhance the conduct of machine learning research with a focus on verifiable and replicable evidence in published papers.