Restless bandits that hide their hand and recommendation systems

2017 9th International Conference on Communication Systems and Networks (COMSNETS) Pub Date : 1900-01-01 DOI:10.1109/COMSNETS.2017.7945378

R. Meshram, Aditya Gopalan, D. Manjunath

{"title":"Restless bandits that hide their hand and recommendation systems","authors":"R. Meshram, Aditya Gopalan, D. Manjunath","doi":"10.1109/COMSNETS.2017.7945378","DOIUrl":null,"url":null,"abstract":"We consider a restless multi-armed bandit (RMAB) in which each arm can be in one of two states, say 0 or 1. Playing the arm brings it to state 0 with probability one and not playing it induces state transitions with arm-dependent probabilities. Playing an arm generates a unit reward with a probability that depends on the state of the arm. The belief about the state of the arm can be calculated using a Bayesian update after every play. This RMAB has been designed for use in recommendation systems which in turn can be used in applications like creating of playlists or placement of advertisements. In this paper we analyse the RMAB by first showing that it is Whittle-indexable and then obtain a closed form expression for the Whittle index for each arm calculated from the belief about its state and the parameters that describe the arm. For an RMAB to be useful in practice, we need to be able to learn the parameters of the arms. We present an algorithm derived from Thompson sampling scheme, that learns the parameters of the arms and also evaluate its performance numerically.","PeriodicalId":168357,"journal":{"name":"2017 9th International Conference on Communication Systems and Networks (COMSNETS)","volume":"10 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"1900-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"8","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2017 9th International Conference on Communication Systems and Networks (COMSNETS)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/COMSNETS.2017.7945378","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 8

Abstract

We consider a restless multi-armed bandit (RMAB) in which each arm can be in one of two states, say 0 or 1. Playing the arm brings it to state 0 with probability one and not playing it induces state transitions with arm-dependent probabilities. Playing an arm generates a unit reward with a probability that depends on the state of the arm. The belief about the state of the arm can be calculated using a Bayesian update after every play. This RMAB has been designed for use in recommendation systems which in turn can be used in applications like creating of playlists or placement of advertisements. In this paper we analyse the RMAB by first showing that it is Whittle-indexable and then obtain a closed form expression for the Whittle index for each arm calculated from the belief about its state and the parameters that describe the arm. For an RMAB to be useful in practice, we need to be able to learn the parameters of the arms. We present an algorithm derived from Thompson sampling scheme, that learns the parameters of the arms and also evaluate its performance numerically.

查看原文本刊更多论文

不安分的强盗隐藏他们的手和推荐系统

我们考虑一个不安分的多臂强盗(RMAB)，其中每条手臂可以处于两种状态之一，比如0或1。玩手臂会使它以1的概率进入状态0，不玩手臂会导致状态转换，其概率与手臂相关。玩一只手臂会产生一个单位奖励，其概率取决于手臂的状态。关于手臂状态的信念可以在每次比赛后使用贝叶斯更新来计算。这个RMAB被设计用于推荐系统，而推荐系统又可以用于创建播放列表或放置广告等应用程序。本文首先对RMAB进行了分析，证明了它是可惠特尔索引的，然后得到了每个臂的惠特尔指数的封闭形式表达式，该表达式由臂的状态和描述臂的参数的信念计算得到。为了使RMAB在实践中有用，我们需要能够学习臂的参数。我们提出了一种基于汤普森采样方案的算法，该算法学习了机械臂的参数，并对其性能进行了数值评估。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2017 9th International Conference on Communication Systems and Networks (COMSNETS)

自引率

0.00%

发文量