Parametrized stochastic multi-armed bandits with binary rewards

Proceedings of the 2011 American Control Conference Pub Date : 2011-08-18 DOI:10.1109/ACC.2011.5991289

Chong Jiang, R. Srikant

引用次数: 3

Abstract

In this paper, we consider the problem of multi armed bandits with a large number of correlated arms. We assume that the arms have Bernoulli distributed rewards, independent across time, where the probabilities of success are parametrized by known attribute vectors for each arm, as well as an unknown preference vector, each of dimension n. For this model, we seek an algorithm with a total regret that is sub-linear in time and independent of the number of arms. We present such an algorithm, which we call the Three-phase Algorithm, and analyze its performance. We show an upper bound on the total regret which applies uniformly in time. The asymptotics of this bound show that for any f ∈ ω(log(T)), the total regret can be made to be O(n·f(T)), independent of the number of arms.

查看原文本刊更多论文

具有二元奖励的参数化随机多臂强盗

本文研究了具有大量相关武器的多武装盗匪问题。我们假设手臂具有伯努利分布奖励，独立于时间，其中成功的概率由每个手臂的已知属性向量和未知偏好向量参数化，每个维度为n。对于该模型，我们寻求具有总遗憾的算法，该算法在时间上是亚线性的，与手臂的数量无关。本文提出了一种称为“三相算法”的算法，并对其性能进行了分析。我们给出了总后悔的上界，它在时间上是一致的。该界的渐近性表明，对于任意f∈ω(log(T))，总遗憾值可设为O(n·f(T))，与臂数无关。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Proceedings of the 2011 American Control Conference

自引率

0.00%

发文量