非近视眼agent在Stackelberg游戏中的学习

Proceedings of the 23rd ACM Conference on Economics and Computation Pub Date : 2022-07-12 DOI:10.1145/3490486.3538308

Nika Haghtalab, Thodoris Lykouris, Sloan Nietert, Alexander Wei

{"title":"非近视眼agent在Stackelberg游戏中的学习","authors":"Nika Haghtalab, Thodoris Lykouris, Sloan Nietert, Alexander Wei","doi":"10.1145/3490486.3538308","DOIUrl":null,"url":null,"abstract":"Stackelberg games are a canonical model for strategic principal-agent interactions. Consider, for instance, a defense system that distributes its security resources across high-risk targets prior to attacks being executed; or a tax policymaker who sets rules on when audits are triggered prior to seeing filed tax reports; or a seller who chooses a price prior to knowing a customer's proclivity to buy. In each of these scenarios, a principal first selects an action x∈X and then an agent reacts with an action y∈Y, where X and Y are the principal's and agent's action spaces, respectively. In the examples above, agent actions correspond to which target to attack, how much tax to pay to evade an audit, and how much to purchase, respectively. Typically, the principal wants an x that maximizes their payoff when the agent plays a best response y = br(x); such a pair (x, y) is a Stackelberg equilibrium. By committing to a strategy, the principal can guarantee they achieve a higher payoff than in the fixed point equilibrium of the corresponding simultaneous-play game. However, finding such a strategy requires knowledge of the agent's payoff function. When faced with unknown agent payoffs, the principal can attempt to learn a best response via repeated interactions with the agent. If a (naïve) agent is unaware that such learning occurs and always plays a best response, the principal can use classical online learning approaches to optimize their own payoff in the stage game. Learning from myopic agents has been extensively studied in multiple Stackelberg games, including security games[2,6,7], demand learning[1,5], and strategic classification[3,4]. However, long-lived agents will generally not volunteer information that can be used against them in the future. This is especially the case in online environments where a learner seeks to exploit recently learned patterns of behavior as soon as possible, and the agent can see a tangible advantage for deviating from its instantaneous best response and leading the learner astray. This trade-off between the (statistical) efficiency of learning algorithms and the perverse incentives they may create over the long-term brings us to the main questions of this work: What are principled approaches to learning against non-myopic agents in general Stackelberg games? How can insights from learning against myopic agents be applied to learning in the non-myopic case?","PeriodicalId":209859,"journal":{"name":"Proceedings of the 23rd ACM Conference on Economics and Computation","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2022-07-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"10","resultStr":"{\"title\":\"Learning in Stackelberg Games with Non-myopic Agents\",\"authors\":\"Nika Haghtalab, Thodoris Lykouris, Sloan Nietert, Alexander Wei\",\"doi\":\"10.1145/3490486.3538308\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Stackelberg games are a canonical model for strategic principal-agent interactions. Consider, for instance, a defense system that distributes its security resources across high-risk targets prior to attacks being executed; or a tax policymaker who sets rules on when audits are triggered prior to seeing filed tax reports; or a seller who chooses a price prior to knowing a customer's proclivity to buy. In each of these scenarios, a principal first selects an action x∈X and then an agent reacts with an action y∈Y, where X and Y are the principal's and agent's action spaces, respectively. In the examples above, agent actions correspond to which target to attack, how much tax to pay to evade an audit, and how much to purchase, respectively. Typically, the principal wants an x that maximizes their payoff when the agent plays a best response y = br(x); such a pair (x, y) is a Stackelberg equilibrium. By committing to a strategy, the principal can guarantee they achieve a higher payoff than in the fixed point equilibrium of the corresponding simultaneous-play game. However, finding such a strategy requires knowledge of the agent's payoff function. When faced with unknown agent payoffs, the principal can attempt to learn a best response via repeated interactions with the agent. If a (naïve) agent is unaware that such learning occurs and always plays a best response, the principal can use classical online learning approaches to optimize their own payoff in the stage game. Learning from myopic agents has been extensively studied in multiple Stackelberg games, including security games[2,6,7], demand learning[1,5], and strategic classification[3,4]. However, long-lived agents will generally not volunteer information that can be used against them in the future. This is especially the case in online environments where a learner seeks to exploit recently learned patterns of behavior as soon as possible, and the agent can see a tangible advantage for deviating from its instantaneous best response and leading the learner astray. This trade-off between the (statistical) efficiency of learning algorithms and the perverse incentives they may create over the long-term brings us to the main questions of this work: What are principled approaches to learning against non-myopic agents in general Stackelberg games? How can insights from learning against myopic agents be applied to learning in the non-myopic case?\",\"PeriodicalId\":209859,\"journal\":{\"name\":\"Proceedings of the 23rd ACM Conference on Economics and Computation\",\"volume\":null,\"pages\":null},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2022-07-12\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"10\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the 23rd ACM Conference on Economics and Computation\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3490486.3538308\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 23rd ACM Conference on Economics and Computation","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3490486.3538308","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 10

摘要

Stackelberg博弈是战略主体-代理互动的典型模型。例如，考虑在攻击执行之前将其安全资源分配给高风险目标的防御系统;或者是一名税务政策制定者，他在看到提交的税务报告之前，就何时启动审计制定了规则;或者一个卖家在知道顾客的购买倾向之前就选择价格。在这些场景中，主体首先选择一个动作x∈x，然后代理以一个动作y∈y做出反应，其中x和y分别是主体和代理的动作空间。在上面的例子中，代理行为分别对应于攻击哪个目标、为逃避审计支付多少税以及购买多少。通常，当代理人采取最佳对策y = br(x)时，委托人希望x能使其收益最大化;这样的一对(x, y)是一个Stackelberg平衡。通过承诺一种策略，委托人可以保证他们获得比在相应的同步博弈的不动点均衡中更高的收益。然而，找到这样的策略需要了解代理的收益函数。当面对未知的代理收益时，委托人可以尝试通过与代理的反复交互来学习最佳响应。如果一个(naïve)代理不知道这种学习的发生，并且总是采取最佳对策，委托人可以使用经典的在线学习方法来优化他们自己在阶段博弈中的收益。在多个Stackelberg博弈(包括安全博弈[2,6,7]、需求学习[1,5]和策略分类[3,4])中，对近视代理的学习进行了广泛的研究。然而，长期工作的代理人通常不会自愿提供将来可能被用来对付他们的信息。这在在线环境中尤其如此，学习者试图尽快利用最近学习的行为模式，而代理可以看到偏离其即时最佳反应并将学习者引入歧途的切实优势。学习算法的(统计)效率与它们可能在长期内产生的反常激励之间的权衡，将我们带入了这项工作的主要问题:在一般的Stackelberg博弈中，针对非近视代理进行学习的原则方法是什么?如何将针对近视主体的学习的见解应用于非近视情况下的学习?

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Learning in Stackelberg Games with Non-myopic Agents

Stackelberg games are a canonical model for strategic principal-agent interactions. Consider, for instance, a defense system that distributes its security resources across high-risk targets prior to attacks being executed; or a tax policymaker who sets rules on when audits are triggered prior to seeing filed tax reports; or a seller who chooses a price prior to knowing a customer's proclivity to buy. In each of these scenarios, a principal first selects an action x∈X and then an agent reacts with an action y∈Y, where X and Y are the principal's and agent's action spaces, respectively. In the examples above, agent actions correspond to which target to attack, how much tax to pay to evade an audit, and how much to purchase, respectively. Typically, the principal wants an x that maximizes their payoff when the agent plays a best response y = br(x); such a pair (x, y) is a Stackelberg equilibrium. By committing to a strategy, the principal can guarantee they achieve a higher payoff than in the fixed point equilibrium of the corresponding simultaneous-play game. However, finding such a strategy requires knowledge of the agent's payoff function. When faced with unknown agent payoffs, the principal can attempt to learn a best response via repeated interactions with the agent. If a (naïve) agent is unaware that such learning occurs and always plays a best response, the principal can use classical online learning approaches to optimize their own payoff in the stage game. Learning from myopic agents has been extensively studied in multiple Stackelberg games, including security games[2,6,7], demand learning[1,5], and strategic classification[3,4]. However, long-lived agents will generally not volunteer information that can be used against them in the future. This is especially the case in online environments where a learner seeks to exploit recently learned patterns of behavior as soon as possible, and the agent can see a tangible advantage for deviating from its instantaneous best response and leading the learner astray. This trade-off between the (statistical) efficiency of learning algorithms and the perverse incentives they may create over the long-term brings us to the main questions of this work: What are principled approaches to learning against non-myopic agents in general Stackelberg games? How can insights from learning against myopic agents be applied to learning in the non-myopic case?

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Proceedings of the 23rd ACM Conference on Economics and Computation

自引率

0.00%

发文量