{"title":"带着凹悬赏和凸背包的强盗","authors":"Shipra Agrawal, Nikhil R. Devanur","doi":"10.1145/2600057.2602844","DOIUrl":null,"url":null,"abstract":"In this paper, we consider a very general model for exploration-exploitation tradeoff which allows arbitrary concave rewards and convex constraints on the decisions across time, in addition to the customary limitation on the time horizon. This model subsumes the classic multi-armed bandit (MAB) model, and the Bandits with Knapsacks (BwK) model of Badanidiyuru et al.[2013]. We also consider an extension of this model to allow linear contexts, similar to the linear contextual extension of the MAB model. We demonstrate that a natural and simple extension of the UCB family of algorithms for MAB provides a polynomial time algorithm that has near-optimal regret guarantees for this substantially more general model, and matches the bounds provided by Badanidiyuru et al.[2013] for the special case of BwK, which is quite surprising. We also provide computationally more efficient algorithms by establishing interesting connections between this problem and other well studied problems/algorithms such as the Blackwell approachability problem, online convex optimization, and the Frank-Wolfe technique for convex optimization. We give examples of several concrete applications, where this more general model of bandits allows for richer and/or more efficient formulations of the problem.","PeriodicalId":203155,"journal":{"name":"Proceedings of the fifteenth ACM conference on Economics and computation","volume":"162 4","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2014-02-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"170","resultStr":"{\"title\":\"Bandits with concave rewards and convex knapsacks\",\"authors\":\"Shipra Agrawal, Nikhil R. Devanur\",\"doi\":\"10.1145/2600057.2602844\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"In this paper, we consider a very general model for exploration-exploitation tradeoff which allows arbitrary concave rewards and convex constraints on the decisions across time, in addition to the customary limitation on the time horizon. This model subsumes the classic multi-armed bandit (MAB) model, and the Bandits with Knapsacks (BwK) model of Badanidiyuru et al.[2013]. We also consider an extension of this model to allow linear contexts, similar to the linear contextual extension of the MAB model. We demonstrate that a natural and simple extension of the UCB family of algorithms for MAB provides a polynomial time algorithm that has near-optimal regret guarantees for this substantially more general model, and matches the bounds provided by Badanidiyuru et al.[2013] for the special case of BwK, which is quite surprising. We also provide computationally more efficient algorithms by establishing interesting connections between this problem and other well studied problems/algorithms such as the Blackwell approachability problem, online convex optimization, and the Frank-Wolfe technique for convex optimization. We give examples of several concrete applications, where this more general model of bandits allows for richer and/or more efficient formulations of the problem.\",\"PeriodicalId\":203155,\"journal\":{\"name\":\"Proceedings of the fifteenth ACM conference on Economics and computation\",\"volume\":\"162 4\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2014-02-24\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"170\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the fifteenth ACM conference on Economics and computation\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/2600057.2602844\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the fifteenth ACM conference on Economics and computation","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/2600057.2602844","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 170
摘要
在本文中,我们考虑了一个非常通用的勘探开发权衡模型,除了时间范围上的习惯限制外,该模型允许任意的凹奖励和凸约束在时间范围内的决策。该模型包含了经典的多臂强盗(multi-armed Bandits, MAB)模型和Badanidiyuru等[2013]的强盗带背包(Bandits with Knapsacks, BwK)模型。我们还考虑了该模型的扩展,以允许线性上下文,类似于MAB模型的线性上下文扩展。我们证明了UCB算法家族对MAB的自然而简单的扩展提供了一个多项式时间算法,该算法对这个更一般的模型具有接近最优的遗憾保证,并且匹配Badanidiyuru等人[2013]为BwK的特殊情况提供的边界,这是相当令人惊讶的。我们还通过在这个问题和其他研究得很好的问题/算法(如Blackwell可接近性问题、在线凸优化和Frank-Wolfe凸优化技术)之间建立有趣的联系,提供计算效率更高的算法。我们给出了几个具体应用的例子,在这些应用中,这种更一般的强盗模型允许对问题进行更丰富和/或更有效的表述。
In this paper, we consider a very general model for exploration-exploitation tradeoff which allows arbitrary concave rewards and convex constraints on the decisions across time, in addition to the customary limitation on the time horizon. This model subsumes the classic multi-armed bandit (MAB) model, and the Bandits with Knapsacks (BwK) model of Badanidiyuru et al.[2013]. We also consider an extension of this model to allow linear contexts, similar to the linear contextual extension of the MAB model. We demonstrate that a natural and simple extension of the UCB family of algorithms for MAB provides a polynomial time algorithm that has near-optimal regret guarantees for this substantially more general model, and matches the bounds provided by Badanidiyuru et al.[2013] for the special case of BwK, which is quite surprising. We also provide computationally more efficient algorithms by establishing interesting connections between this problem and other well studied problems/algorithms such as the Blackwell approachability problem, online convex optimization, and the Frank-Wolfe technique for convex optimization. We give examples of several concrete applications, where this more general model of bandits allows for richer and/or more efficient formulations of the problem.