Model-Based Reinforcement Learning for Offline Zero-Sum Markov Games

IF 2.2 3区管理学 Q3 MANAGEMENT

Operations Research Pub Date : 2024-04-02 DOI:10.1287/opre.2022.0342

Yuling Yan, Gen Li, Yuxin Chen, Jianqing Fan

{"title":"Model-Based Reinforcement Learning for Offline Zero-Sum Markov Games","authors":"Yuling Yan, Gen Li, Yuxin Chen, Jianqing Fan","doi":"10.1287/opre.2022.0342","DOIUrl":null,"url":null,"abstract":"This paper makes progress toward learning Nash equilibria in two-player, zero-sum Markov games from offline data. Specifically, consider a γ-discounted, infinite-horizon Markov game with S states, in which the max-player has A actions and the min-player has B actions. We propose a pessimistic model–based algorithm with Bernstein-style lower confidence bounds—called the value iteration with lower confidence bounds for zero-sum Markov games—that provably finds an ε-approximate Nash equilibrium with a sample complexity no larger than <math altimg=\"eq-00001.gif\" display=\"inline\" overflow=\"scroll\"><mrow><mfrac><mrow><msubsup><mrow><mi>C</mi></mrow><mrow><mtext mathvariant=\"sans-serif\">clipped</mtext></mrow><mi>⋆</mi></msubsup><mi>S</mi><mo stretchy=\"false\">(</mo><mi>A</mi><mo>+</mo><mi>B</mi><mo stretchy=\"false\">)</mo></mrow><mrow><msup><mrow><mo stretchy=\"false\">(</mo><mn>1</mn><mo>−</mo><mi>γ</mi><mo stretchy=\"false\">)</mo></mrow><mn>3</mn></msup><msup><mrow><mi>ε</mi></mrow><mn>2</mn></msup></mrow></mfrac></mrow></math> (up to some log factor). Here, <math altimg=\"eq-00002.gif\" display=\"inline\" overflow=\"scroll\"><mrow><msubsup><mrow><mi>C</mi></mrow><mrow><mtext mathvariant=\"sans-serif\">clipped</mtext></mrow><mi>⋆</mi></msubsup></mrow></math> is some unilateral clipped concentrability coefficient that reflects the coverage and distribution shift of the available data (vis-à-vis the target data), and the target accuracy ε can be any value within <math altimg=\"eq-00003.gif\" display=\"inline\" overflow=\"scroll\"><mrow><mrow><mo>(</mo><mrow><mn>0</mn><mo>,</mo><mfrac><mn>1</mn><mrow><mn>1</mn><mo>−</mo><mi>γ</mi></mrow></mfrac></mrow><mo>]</mo></mrow></mrow></math>. Our sample complexity bound strengthens prior art by a factor of <math altimg=\"eq-00004.gif\" display=\"inline\" overflow=\"scroll\"><mrow><mi>min</mi><mo stretchy=\"false\">{</mo><mi>A</mi><mo>,</mo><mi>B</mi><mo stretchy=\"false\">}</mo></mrow></math>, achieving minimax optimality for a broad regime of interest. An appealing feature of our result lies in its algorithmic simplicity, which reveals the unnecessity of variance reduction and sample splitting in achieving sample optimality.Funding: Y. Yan is supported in part by the Charlotte Elizabeth Procter Honorific Fellowship from Princeton University and the Norbert Wiener Postdoctoral Fellowship from MIT. Y. Chen is supported in part by the Alfred P. Sloan Research Fellowship, the Google Research Scholar Award, the Air Force Office of Scientific Research [Grant FA9550-22-1-0198], the Office of Naval Research [Grant N00014-22-1-2354], and the National Science Foundation [Grants CCF-2221009, CCF-1907661, IIS-2218713, DMS-2014279, and IIS-2218773]. J. Fan is supported in part by the National Science Foundation [Grants DMS-1712591, DMS-2052926, DMS-2053832, and DMS-2210833] and Office of Naval Research [Grant N00014-22-1-2340].Supplemental Material: The online appendix is available at https://doi.org/10.1287/opre.2022.0342.","PeriodicalId":54680,"journal":{"name":"Operations Research","volume":"1 1","pages":""},"PeriodicalIF":2.2000,"publicationDate":"2024-04-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Operations Research","FirstCategoryId":"91","ListUrlMain":"https://doi.org/10.1287/opre.2022.0342","RegionNum":3,"RegionCategory":"管理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"MANAGEMENT","Score":null,"Total":0}

引用次数: 0

Abstract

This paper makes progress toward learning Nash equilibria in two-player, zero-sum Markov games from offline data. Specifically, consider a γ-discounted, infinite-horizon Markov game with S states, in which the max-player has A actions and the min-player has B actions. We propose a pessimistic model–based algorithm with Bernstein-style lower confidence bounds—called the value iteration with lower confidence bounds for zero-sum Markov games—that provably finds an ε-approximate Nash equilibrium with a sample complexity no larger than $\frac{C_{clipped}^{⋆} S (A + B)}{{(1 - γ)}^{3} ε^{2}}$ (up to some log factor). Here, $C_{clipped}^{⋆}$ is some unilateral clipped concentrability coefficient that reflects the coverage and distribution shift of the available data (vis-à-vis the target data), and the target accuracy ε can be any value within $(0, \frac{1}{1 - γ}]$ . Our sample complexity bound strengthens prior art by a factor of $\min {A, B}$ , achieving minimax optimality for a broad regime of interest. An appealing feature of our result lies in its algorithmic simplicity, which reveals the unnecessity of variance reduction and sample splitting in achieving sample optimality.

Funding: Y. Yan is supported in part by the Charlotte Elizabeth Procter Honorific Fellowship from Princeton University and the Norbert Wiener Postdoctoral Fellowship from MIT. Y. Chen is supported in part by the Alfred P. Sloan Research Fellowship, the Google Research Scholar Award, the Air Force Office of Scientific Research [Grant FA9550-22-1-0198], the Office of Naval Research [Grant N00014-22-1-2354], and the National Science Foundation [Grants CCF-2221009, CCF-1907661, IIS-2218713, DMS-2014279, and IIS-2218773]. J. Fan is supported in part by the National Science Foundation [Grants DMS-1712591, DMS-2052926, DMS-2053832, and DMS-2210833] and Office of Naval Research [Grant N00014-22-1-2340].

Supplemental Material: The online appendix is available at https://doi.org/10.1287/opre.2022.0342.

查看原文本刊更多论文

基于模型的离线零和马尔可夫游戏强化学习

本文在从离线数据学习双人零和马尔可夫博弈中的纳什均衡方面取得了进展。具体来说，考虑一个具有 S 种状态的 γ 贴现无限视距马尔可夫博弈，其中最大玩家有 A 种行动，最小玩家有 B 种行动。我们提出了一种基于模型的悲观算法，该算法具有伯恩斯坦式置信下限，即零和马尔可夫博弈的置信下限值迭代，可以证明它能找到一个ε近似纳什均衡，样本复杂度不大于 Cclipped⋆S(A+B)(1-γ)3ε2（最多不超过某个对数因子）。这里，Cclipped⋆ 是某个单边剪切的同质性系数，反映了可用数据（相对于目标数据）的覆盖范围和分布偏移，而目标精度 ε 可以是 (0,11-γ] 范围内的任意值。我们的样本复杂度约束以最小{A,B}的系数加强了现有技术，在广泛的兴趣范围内实现了最小最优。我们的结果的一个吸引人之处在于其算法简单，它揭示了在实现样本最优性过程中减少方差和样本分割的必要性：严宇部分获得普林斯顿大学夏洛特-伊丽莎白-普罗克特荣誉奖学金和麻省理工学院诺伯特-维纳博士后奖学金的资助。Y. Chen 的部分研究经费来自 Alfred P. Sloan 研究奖学金、谷歌研究学者奖、空军科学研究办公室[FA9550-22-1-0198 号拨款]、海军研究办公室[N00014-22-1-2354 号拨款]和美国国家科学基金会[CCF-2221009、CCF-1907661、IIS-2218713、DMS-2014279 和 IIS-2218773 号拨款]。J. Fan 部分获得了美国国家科学基金会 [资助 DMS-1712591、DMS-2052926、DMS-2053832 和 DMS-2210833] 和海军研究办公室 [资助 N00014-22-1-2340] 的资助：在线附录见 https://doi.org/10.1287/opre.2022.0342。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Operations Research 管理科学-运筹学与管理科学

CiteScore

4.80

自引率

14.80%

发文量

237

审稿时长

15 months

期刊介绍： Operations Research publishes quality operations research and management science works of interest to the OR practitioner and researcher in three substantive categories: methods, data-based operational science, and the practice of OR. The journal seeks papers reporting underlying data-based principles of operational science, observations and modeling of operating systems, contributions to the methods and models of OR, case histories of applications, review articles, and discussions of the administrative environment, history, policy, practice, future, and arenas of application of operations research.