Sequential Information Design: Markov Persuasion Process and Its Efficient Reinforcement Learning

Proceedings of the 23rd ACM Conference on Economics and Computation Pub Date : 2022-02-22 DOI:10.1145/3490486.3538313

Jibang Wu, Zixuan Zhang, Zhe Feng, Zhaoran Wang, Zhuoran Yang, Michael I. Jordan, Haifeng Xu

{"title":"Sequential Information Design: Markov Persuasion Process and Its Efficient Reinforcement Learning","authors":"Jibang Wu, Zixuan Zhang, Zhe Feng, Zhaoran Wang, Zhuoran Yang, Michael I. Jordan, Haifeng Xu","doi":"10.1145/3490486.3538313","DOIUrl":null,"url":null,"abstract":"In today's economy, it becomes important for Internet platforms to consider the sequential information design problem to align its long term interest with incentives of the gig service providers (e.g., drivers, hosts). This paper proposes a novel model of sequential information design, namely the Markov persuasion processes (MPPs), in which a sender, with informational advantage, seeks to persuade a stream of myopic receivers to take actions that maximize the sender's cumulative utilities in a finite horizon Markovian environment with varying prior and utility functions. Planning in MPPs thus faces the unique challenge in finding a signaling policy that is simultaneously persuasive to the myopic receivers and inducing the optimal long-term cumulative utilities of the sender. Nevertheless, in the population level where the model is known, it turns out that we can efficiently determine the optimal (resp. ε-optimal) policy with finite (resp. infinite) states and outcomes, through a modified formulation of the Bellman equation that additionally takes persuasiveness into consideration. Our main technical contribution is to study the MPP under the online reinforcement learning (RL) setting, where the goal is to learn the optimal signaling policy by interacting with with the underlying MPP, without the knowledge of the sender's utility functions, prior distributions, and the Markov transition kernels. For such a problem, we design a provably efficient no-regret learning algorithm, the Optimism-Pessimism Principle for Persuasion Process (OP4), which features a novel combination of both optimism and pessimism principles. In particular, we obtain optimistic estimates of the value functions to encourage exploration under the unknown environment, and additionally robustify the signaling policy with respect to the uncertainty of prior estimation to prevent receiver's detrimental equilibrium behavior. Our algorithm enjoys sample efficiency by achieving a sublinear √T-regret upper bound. Furthermore, both our algorithm and theory can be applied to MPPs with large space of outcomes and states via function approximation, and we showcase such a success under the linear setting.","PeriodicalId":209859,"journal":{"name":"Proceedings of the 23rd ACM Conference on Economics and Computation","volume":"9 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-02-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"19","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 23rd ACM Conference on Economics and Computation","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3490486.3538313","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 19

Abstract

In today's economy, it becomes important for Internet platforms to consider the sequential information design problem to align its long term interest with incentives of the gig service providers (e.g., drivers, hosts). This paper proposes a novel model of sequential information design, namely the Markov persuasion processes (MPPs), in which a sender, with informational advantage, seeks to persuade a stream of myopic receivers to take actions that maximize the sender's cumulative utilities in a finite horizon Markovian environment with varying prior and utility functions. Planning in MPPs thus faces the unique challenge in finding a signaling policy that is simultaneously persuasive to the myopic receivers and inducing the optimal long-term cumulative utilities of the sender. Nevertheless, in the population level where the model is known, it turns out that we can efficiently determine the optimal (resp. ε-optimal) policy with finite (resp. infinite) states and outcomes, through a modified formulation of the Bellman equation that additionally takes persuasiveness into consideration. Our main technical contribution is to study the MPP under the online reinforcement learning (RL) setting, where the goal is to learn the optimal signaling policy by interacting with with the underlying MPP, without the knowledge of the sender's utility functions, prior distributions, and the Markov transition kernels. For such a problem, we design a provably efficient no-regret learning algorithm, the Optimism-Pessimism Principle for Persuasion Process (OP4), which features a novel combination of both optimism and pessimism principles. In particular, we obtain optimistic estimates of the value functions to encourage exploration under the unknown environment, and additionally robustify the signaling policy with respect to the uncertainty of prior estimation to prevent receiver's detrimental equilibrium behavior. Our algorithm enjoys sample efficiency by achieving a sublinear √T-regret upper bound. Furthermore, both our algorithm and theory can be applied to MPPs with large space of outcomes and states via function approximation, and we showcase such a success under the linear setting.

查看原文本刊更多论文

序贯信息设计:马尔可夫说服过程及其有效强化学习

在今天的经济中，互联网平台考虑顺序信息设计问题，使其长期利益与零工服务提供商(例如，司机，主机)的激励保持一致，这一点变得非常重要。本文提出了一种新的序列信息设计模型，即马尔可夫说服过程(MPPs)，在该模型中，具有信息优势的发送者在具有不同先验函数和效用函数的有限视界马尔可夫环境中，试图说服一群短视的接收者采取行动，使发送者的累积效用最大化。因此，mpp的规划面临着一个独特的挑战，即寻找一种信号策略，既能说服短视的接收者，又能诱导发送者获得最佳的长期累积效用。然而，在已知模型的总体水平上，事实证明我们可以有效地确定最优(resp)。有限响应的ε-最优策略。无限)状态和结果，通过修改Bellman方程的公式，另外考虑了说服力。我们的主要技术贡献是研究在线强化学习(RL)设置下的MPP，其目标是通过与底层MPP交互来学习最佳信令策略，而不需要了解发送方的效用函数、先验分布和马尔可夫转换核。针对这一问题，我们设计了一种可证明高效的无遗憾学习算法——乐观-悲观说服过程原则(OP4)，该算法将乐观原则和悲观原则结合在一起。特别是，我们获得了价值函数的乐观估计，以鼓励在未知环境下的探索，并且根据先验估计的不确定性对信号策略进行鲁棒化，以防止接收者的有害均衡行为。我们的算法通过实现次线性/ t -遗憾上界而享有样本效率。此外，我们的算法和理论都可以通过函数逼近应用于具有大结果和状态空间的mpp，并在线性设置下取得了成功。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Proceedings of the 23rd ACM Conference on Economics and Computation

自引率

0.00%

发文量