激励探索

Proceedings of the fifteenth ACM conference on Economics and computation Pub Date : 2014-06-01 DOI:10.1145/2600057.2602897

P. Frazier, D. Kempe, J. Kleinberg, Robert D. Kleinberg

{"title":"激励探索","authors":"P. Frazier, D. Kempe, J. Kleinberg, Robert D. Kleinberg","doi":"10.1145/2600057.2602897","DOIUrl":null,"url":null,"abstract":"We study a Bayesian multi-armed bandit (MAB) setting in which a principal seeks to maximize the sum of expected time-discounted rewards obtained by pulling arms, when the arms are actually pulled by selfish and myopic individuals. Since such individuals pull the arm with highest expected posterior reward (i.e., they always exploit and never explore), the principal must incentivize them to explore by offering suitable payments. Among others, this setting models crowdsourced information discovery and funding agencies incentivizing scientists to perform high-risk, high-reward research. We explore the tradeoff between the principal's total expected time-discounted incentive payments, and the total time-discounted rewards realized. Specifically, with a time-discount factor γ ∈ (0,1), let OPT denote the total expected time-discounted reward achievable by a principal who pulls arms directly in a MAB problem, without having to incentivize selfish agents. We call a pair (ρ,b) ∈ [0,1]2 consisting of a reward ρ and payment b achievable if for every MAB instance, using expected time-discounted payments of at most b•OPT, the principal can guarantee an expected time-discounted reward of at least ρ•OPT. Our main result is an essentially complete characterization of achievable (payment, reward) pairs: if √b+√1-ρ>√γ, then (ρ,b) is achievable, and if √b+√1-ρ<√γ, then (ρ,b) is not achievable. In proving this characterization, we analyze so-called time-expanded policies, which in each step let the agents choose myopically with some probability p, and incentivize them to choose \"optimally\" with probability 1-p. The analysis of time-expanded policies leads to a question that may be of independent interest: If the same MAB instance (without selfish agents) is considered under two different time-discount rates γ > η, how small can the ratio of OPTη to OPTγ be? We give a complete answer to this question, showing that OPTη ≥ (1-γ)2/(1-η)2 • OPTγ, and that this bound is tight.","PeriodicalId":203155,"journal":{"name":"Proceedings of the fifteenth ACM conference on Economics and computation","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2014-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"108","resultStr":"{\"title\":\"Incentivizing exploration\",\"authors\":\"P. Frazier, D. Kempe, J. Kleinberg, Robert D. Kleinberg\",\"doi\":\"10.1145/2600057.2602897\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"We study a Bayesian multi-armed bandit (MAB) setting in which a principal seeks to maximize the sum of expected time-discounted rewards obtained by pulling arms, when the arms are actually pulled by selfish and myopic individuals. Since such individuals pull the arm with highest expected posterior reward (i.e., they always exploit and never explore), the principal must incentivize them to explore by offering suitable payments. Among others, this setting models crowdsourced information discovery and funding agencies incentivizing scientists to perform high-risk, high-reward research. We explore the tradeoff between the principal's total expected time-discounted incentive payments, and the total time-discounted rewards realized. Specifically, with a time-discount factor γ ∈ (0,1), let OPT denote the total expected time-discounted reward achievable by a principal who pulls arms directly in a MAB problem, without having to incentivize selfish agents. We call a pair (ρ,b) ∈ [0,1]2 consisting of a reward ρ and payment b achievable if for every MAB instance, using expected time-discounted payments of at most b•OPT, the principal can guarantee an expected time-discounted reward of at least ρ•OPT. Our main result is an essentially complete characterization of achievable (payment, reward) pairs: if √b+√1-ρ>√γ, then (ρ,b) is achievable, and if √b+√1-ρ<√γ, then (ρ,b) is not achievable. In proving this characterization, we analyze so-called time-expanded policies, which in each step let the agents choose myopically with some probability p, and incentivize them to choose \\\"optimally\\\" with probability 1-p. The analysis of time-expanded policies leads to a question that may be of independent interest: If the same MAB instance (without selfish agents) is considered under two different time-discount rates γ > η, how small can the ratio of OPTη to OPTγ be? We give a complete answer to this question, showing that OPTη ≥ (1-γ)2/(1-η)2 • OPTγ, and that this bound is tight.\",\"PeriodicalId\":203155,\"journal\":{\"name\":\"Proceedings of the fifteenth ACM conference on Economics and computation\",\"volume\":\"1 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2014-06-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"108\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the fifteenth ACM conference on Economics and computation\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/2600057.2602897\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the fifteenth ACM conference on Economics and computation","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/2600057.2602897","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 108

摘要

我们研究了一个贝叶斯多臂强盗(MAB)设置，在该设置中，当手臂实际上是由自私和短视的个体拉的时候，委托人寻求通过拉手臂获得的期望时间贴现奖励的最大化。由于这样的个体以最高的期望后验回报拉动手臂(即，他们总是剥削而从不探索)，校长必须通过提供适当的报酬来激励他们探索。除此之外，这种模式还包括众包信息发现和资助机构激励科学家从事高风险、高回报的研究。我们探讨了本金的总预期时间贴现激励支付和实现的总时间贴现奖励之间的权衡。具体地说，当时间折现因子γ∈(0,1)时，令OPT表示在MAB问题中，不需要激励自私的主体，直接拉臂的主体所能获得的总期望时间折现奖励。我们称一对(ρ，b)∈[0,1]2，由奖励ρ和支付b组成，如果对于每个MAB实例，使用至多b•OPT的预期时间贴现支付，委托人可以保证至少ρ•OPT的预期时间贴现奖励，则可以实现。我们的主要结果是一个基本完整的可实现(报酬，奖励)对的表征:如果√b+√1-ρ>√γ，那么(ρ，b)是可实现的，如果√b+√1-ρ η， OPTη与OPTγ的比值可以有多小?我们给出了这个问题的完整答案，表明OPTη≥(1-γ)2/(1-η)2•OPTγ，并且这个界是紧的。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Incentivizing exploration

We study a Bayesian multi-armed bandit (MAB) setting in which a principal seeks to maximize the sum of expected time-discounted rewards obtained by pulling arms, when the arms are actually pulled by selfish and myopic individuals. Since such individuals pull the arm with highest expected posterior reward (i.e., they always exploit and never explore), the principal must incentivize them to explore by offering suitable payments. Among others, this setting models crowdsourced information discovery and funding agencies incentivizing scientists to perform high-risk, high-reward research. We explore the tradeoff between the principal's total expected time-discounted incentive payments, and the total time-discounted rewards realized. Specifically, with a time-discount factor γ ∈ (0,1), let OPT denote the total expected time-discounted reward achievable by a principal who pulls arms directly in a MAB problem, without having to incentivize selfish agents. We call a pair (ρ,b) ∈ [0,1]2 consisting of a reward ρ and payment b achievable if for every MAB instance, using expected time-discounted payments of at most b•OPT, the principal can guarantee an expected time-discounted reward of at least ρ•OPT. Our main result is an essentially complete characterization of achievable (payment, reward) pairs: if √b+√1-ρ>√γ, then (ρ,b) is achievable, and if √b+√1-ρ<√γ, then (ρ,b) is not achievable. In proving this characterization, we analyze so-called time-expanded policies, which in each step let the agents choose myopically with some probability p, and incentivize them to choose "optimally" with probability 1-p. The analysis of time-expanded policies leads to a question that may be of independent interest: If the same MAB instance (without selfish agents) is considered under two different time-discount rates γ > η, how small can the ratio of OPTη to OPTγ be? We give a complete answer to this question, showing that OPTη ≥ (1-γ)2/(1-η)2 • OPTγ, and that this bound is tight.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Proceedings of the fifteenth ACM conference on Economics and computation

自引率

0.00%

发文量