利用特权信息进行部分可观察强化学习

IF 2.8 4区 计算机科学 Q3 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE
Jinqiu Li;Enmin Zhao;Tong Wei;Junliang Xing;Shiming Xiang
{"title":"利用特权信息进行部分可观察强化学习","authors":"Jinqiu Li;Enmin Zhao;Tong Wei;Junliang Xing;Shiming Xiang","doi":"10.1109/TG.2025.3542158","DOIUrl":null,"url":null,"abstract":"Reinforcement learning has achieved remarkable success across diverse scenarios. However, learning optimal policies within partially observable games remains a formidable challenge. Crucial privileged information in states is often shrouded during gameplay, yet ideally, it should be accessible and exploitable during training. Previous studies have concentrated on formulating policies based wholly on partial observations or oracle states. Nevertheless, these approaches often face hindrances in attaining effective generalization. To surmount this challenge, we propose the actor–cross-critic (ACC) learning framework, integrating both partial observations and oracle states. ACC achieves this by coordinating two critics and invoking a maximization operation mechanism to switch between them dynamically. This approach encourages the selection of the higher values when computing advantages within the actor–critic framework, thereby accelerating learning and mitigating bias under partial observability. Some theoretical analyses show that ACC exhibits better learning ability toward optimal policies than actor–critic learning using the oracle states. We highlight its superior performance through comprehensive evaluations in decision-making tasks, such as <italic>QuestBall</i>, <italic>Minigrid</i>, and <italic>Atari</i>, and the challenging card game <italic>DouDizhu</i>.","PeriodicalId":55977,"journal":{"name":"IEEE Transactions on Games","volume":"17 3","pages":"765-776"},"PeriodicalIF":2.8000,"publicationDate":"2025-02-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Leveraging Privileged Information for Partially Observable Reinforcement Learning\",\"authors\":\"Jinqiu Li;Enmin Zhao;Tong Wei;Junliang Xing;Shiming Xiang\",\"doi\":\"10.1109/TG.2025.3542158\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Reinforcement learning has achieved remarkable success across diverse scenarios. However, learning optimal policies within partially observable games remains a formidable challenge. Crucial privileged information in states is often shrouded during gameplay, yet ideally, it should be accessible and exploitable during training. Previous studies have concentrated on formulating policies based wholly on partial observations or oracle states. Nevertheless, these approaches often face hindrances in attaining effective generalization. To surmount this challenge, we propose the actor–cross-critic (ACC) learning framework, integrating both partial observations and oracle states. ACC achieves this by coordinating two critics and invoking a maximization operation mechanism to switch between them dynamically. This approach encourages the selection of the higher values when computing advantages within the actor–critic framework, thereby accelerating learning and mitigating bias under partial observability. Some theoretical analyses show that ACC exhibits better learning ability toward optimal policies than actor–critic learning using the oracle states. We highlight its superior performance through comprehensive evaluations in decision-making tasks, such as <italic>QuestBall</i>, <italic>Minigrid</i>, and <italic>Atari</i>, and the challenging card game <italic>DouDizhu</i>.\",\"PeriodicalId\":55977,\"journal\":{\"name\":\"IEEE Transactions on Games\",\"volume\":\"17 3\",\"pages\":\"765-776\"},\"PeriodicalIF\":2.8000,\"publicationDate\":\"2025-02-13\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"IEEE Transactions on Games\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://ieeexplore.ieee.org/document/10887124/\",\"RegionNum\":4,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q3\",\"JCRName\":\"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Games","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/10887124/","RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
引用次数: 0

摘要

强化学习在不同的场景中取得了显著的成功。然而,在部分可观察的游戏中学习最佳策略仍然是一项艰巨的挑战。国家中的关键特权信息通常在游戏过程中被掩盖,但理想情况下,它应该在训练过程中被访问和利用。以前的研究集中在完全基于部分观察或神谕状态来制定政策。然而,这些方法在获得有效泛化方面经常面临障碍。为了克服这一挑战,我们提出了行动者-跨批评家(ACC)学习框架,整合了部分观察和预言状态。ACC通过协调两个批评家并调用最大化操作机制在它们之间动态切换来实现这一点。这种方法鼓励在行动者-批评者框架内计算优势时选择较高的值,从而加速学习并减轻部分可观察性下的偏见。一些理论分析表明,ACC对最优策略的学习能力优于使用神谕状态的行动者-批评者学习。我们通过在决策任务(如QuestBall、Minigrid和Atari)以及具有挑战性的纸牌游戏豆滴珠(DouDizhu)中的综合评估来突出其优越的性能。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
Leveraging Privileged Information for Partially Observable Reinforcement Learning
Reinforcement learning has achieved remarkable success across diverse scenarios. However, learning optimal policies within partially observable games remains a formidable challenge. Crucial privileged information in states is often shrouded during gameplay, yet ideally, it should be accessible and exploitable during training. Previous studies have concentrated on formulating policies based wholly on partial observations or oracle states. Nevertheless, these approaches often face hindrances in attaining effective generalization. To surmount this challenge, we propose the actor–cross-critic (ACC) learning framework, integrating both partial observations and oracle states. ACC achieves this by coordinating two critics and invoking a maximization operation mechanism to switch between them dynamically. This approach encourages the selection of the higher values when computing advantages within the actor–critic framework, thereby accelerating learning and mitigating bias under partial observability. Some theoretical analyses show that ACC exhibits better learning ability toward optimal policies than actor–critic learning using the oracle states. We highlight its superior performance through comprehensive evaluations in decision-making tasks, such as QuestBall, Minigrid, and Atari, and the challenging card game DouDizhu.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
IEEE Transactions on Games
IEEE Transactions on Games Engineering-Electrical and Electronic Engineering
CiteScore
4.60
自引率
8.70%
发文量
87
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信