Parametrized stochastic multi-armed bandits with binary rewards

Chong Jiang, R. Srikant
{"title":"Parametrized stochastic multi-armed bandits with binary rewards","authors":"Chong Jiang, R. Srikant","doi":"10.1109/ACC.2011.5991289","DOIUrl":null,"url":null,"abstract":"In this paper, we consider the problem of multi armed bandits with a large number of correlated arms. We assume that the arms have Bernoulli distributed rewards, independent across time, where the probabilities of success are parametrized by known attribute vectors for each arm, as well as an unknown preference vector, each of dimension n. For this model, we seek an algorithm with a total regret that is sub-linear in time and independent of the number of arms. We present such an algorithm, which we call the Three-phase Algorithm, and analyze its performance. We show an upper bound on the total regret which applies uniformly in time. The asymptotics of this bound show that for any f ∈ ω(log(T)), the total regret can be made to be O(n·f(T)), independent of the number of arms.","PeriodicalId":225201,"journal":{"name":"Proceedings of the 2011 American Control Conference","volume":"52 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2011-08-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"3","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 2011 American Control Conference","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ACC.2011.5991289","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 3

Abstract

In this paper, we consider the problem of multi armed bandits with a large number of correlated arms. We assume that the arms have Bernoulli distributed rewards, independent across time, where the probabilities of success are parametrized by known attribute vectors for each arm, as well as an unknown preference vector, each of dimension n. For this model, we seek an algorithm with a total regret that is sub-linear in time and independent of the number of arms. We present such an algorithm, which we call the Three-phase Algorithm, and analyze its performance. We show an upper bound on the total regret which applies uniformly in time. The asymptotics of this bound show that for any f ∈ ω(log(T)), the total regret can be made to be O(n·f(T)), independent of the number of arms.
具有二元奖励的参数化随机多臂强盗
本文研究了具有大量相关武器的多武装盗匪问题。我们假设手臂具有伯努利分布奖励,独立于时间,其中成功的概率由每个手臂的已知属性向量和未知偏好向量参数化,每个维度为n。对于该模型,我们寻求具有总遗憾的算法,该算法在时间上是亚线性的,与手臂的数量无关。本文提出了一种称为“三相算法”的算法,并对其性能进行了分析。我们给出了总后悔的上界,它在时间上是一致的。该界的渐近性表明,对于任意f∈ω(log(T)),总遗憾值可设为O(n·f(T)),与臂数无关。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信