{"title":"汤普森抽样的近最优后悔界","authors":"Shipra Agrawal, Navin Goyal","doi":"10.1145/3088510","DOIUrl":null,"url":null,"abstract":"Thompson Sampling (TS) is one of the oldest heuristics for multiarmed bandit problems. It is a randomized algorithm based on Bayesian ideas and has recently generated significant interest after several studies demonstrated that it has favorable empirical performance compared to the state-of-the-art methods. In this article, a novel and almost tight martingale-based regret analysis for Thompson Sampling is presented. Our technique simultaneously yields both problem-dependent and problem-independent bounds: (1) the first near-optimal problem-independent bound of O(√ NT ln T) on the expected regret and (2) the optimal problem-dependent bound of (1 + ϵ)Σi ln T / d(μi,μ1) + O(N/ϵ2) on the expected regret (this bound was first proven by Kaufmann et al. (2012b)). Our technique is conceptually simple and easily extends to distributions other than the Beta distribution used in the original TS algorithm. For the version of TS that uses Gaussian priors, we prove a problem-independent bound of O(√ NT ln N) on the expected regret and show the optimality of this bound by providing a matching lower bound. This is the first lower bound on the performance of a natural version of Thompson Sampling that is away from the general lower bound of Ω (√ NT) for the multiarmed bandit problem.","PeriodicalId":17199,"journal":{"name":"Journal of the ACM (JACM)","volume":"70 1","pages":"1 - 24"},"PeriodicalIF":0.0000,"publicationDate":"2017-09-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"112","resultStr":"{\"title\":\"Near-Optimal Regret Bounds for Thompson Sampling\",\"authors\":\"Shipra Agrawal, Navin Goyal\",\"doi\":\"10.1145/3088510\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Thompson Sampling (TS) is one of the oldest heuristics for multiarmed bandit problems. It is a randomized algorithm based on Bayesian ideas and has recently generated significant interest after several studies demonstrated that it has favorable empirical performance compared to the state-of-the-art methods. In this article, a novel and almost tight martingale-based regret analysis for Thompson Sampling is presented. Our technique simultaneously yields both problem-dependent and problem-independent bounds: (1) the first near-optimal problem-independent bound of O(√ NT ln T) on the expected regret and (2) the optimal problem-dependent bound of (1 + ϵ)Σi ln T / d(μi,μ1) + O(N/ϵ2) on the expected regret (this bound was first proven by Kaufmann et al. (2012b)). Our technique is conceptually simple and easily extends to distributions other than the Beta distribution used in the original TS algorithm. For the version of TS that uses Gaussian priors, we prove a problem-independent bound of O(√ NT ln N) on the expected regret and show the optimality of this bound by providing a matching lower bound. This is the first lower bound on the performance of a natural version of Thompson Sampling that is away from the general lower bound of Ω (√ NT) for the multiarmed bandit problem.\",\"PeriodicalId\":17199,\"journal\":{\"name\":\"Journal of the ACM (JACM)\",\"volume\":\"70 1\",\"pages\":\"1 - 24\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2017-09-04\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"112\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Journal of the ACM (JACM)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3088510\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of the ACM (JACM)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3088510","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 112
摘要
汤姆逊抽样(TS)是求解多武装强盗问题的最古老的启发式方法之一。它是一种基于贝叶斯思想的随机算法,最近引起了人们的极大兴趣,因为几项研究表明,与最先进的方法相比,它具有良好的经验表现。本文提出了一种新颖的基于鞅的汤普森抽样后悔分析方法。我们的技术同时产生了问题相关界和问题无关界:(1)期望后悔上的第一个近似最优问题无关界O(√NT ln T)和(2)期望后悔上的最优问题相关界(1 + λ)Σi ln T / d(μi,μ1) + O(N/ϵ2)(该界首先由Kaufmann et al. (2012b)证明)。我们的技术在概念上很简单,并且很容易扩展到原始TS算法中使用的Beta分布以外的分布。对于使用高斯先验的TS版本,我们证明了一个与问题无关的O(√NT ln N)的期望后悔界,并通过提供一个匹配的下界来证明该界的最优性。这是自然版本的汤普森采样性能的第一个下界,它远离了多臂强盗问题的Ω(√NT)的一般下界。
Thompson Sampling (TS) is one of the oldest heuristics for multiarmed bandit problems. It is a randomized algorithm based on Bayesian ideas and has recently generated significant interest after several studies demonstrated that it has favorable empirical performance compared to the state-of-the-art methods. In this article, a novel and almost tight martingale-based regret analysis for Thompson Sampling is presented. Our technique simultaneously yields both problem-dependent and problem-independent bounds: (1) the first near-optimal problem-independent bound of O(√ NT ln T) on the expected regret and (2) the optimal problem-dependent bound of (1 + ϵ)Σi ln T / d(μi,μ1) + O(N/ϵ2) on the expected regret (this bound was first proven by Kaufmann et al. (2012b)). Our technique is conceptually simple and easily extends to distributions other than the Beta distribution used in the original TS algorithm. For the version of TS that uses Gaussian priors, we prove a problem-independent bound of O(√ NT ln N) on the expected regret and show the optimality of this bound by providing a matching lower bound. This is the first lower bound on the performance of a natural version of Thompson Sampling that is away from the general lower bound of Ω (√ NT) for the multiarmed bandit problem.