利用特权信息进行部分可观察强化学习

IF 2.8 4区计算机科学 Q3 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

IEEE Transactions on Games Pub Date : 2025-02-13 DOI:10.1109/TG.2025.3542158

Jinqiu Li;Enmin Zhao;Tong Wei;Junliang Xing;Shiming Xiang

{"title":"利用特权信息进行部分可观察强化学习","authors":"Jinqiu Li;Enmin Zhao;Tong Wei;Junliang Xing;Shiming Xiang","doi":"10.1109/TG.2025.3542158","DOIUrl":null,"url":null,"abstract":"Reinforcement learning has achieved remarkable success across diverse scenarios. However, learning optimal policies within partially observable games remains a formidable challenge. Crucial privileged information in states is often shrouded during gameplay, yet ideally, it should be accessible and exploitable during training. Previous studies have concentrated on formulating policies based wholly on partial observations or oracle states. Nevertheless, these approaches often face hindrances in attaining effective generalization. To surmount this challenge, we propose the actor–cross-critic (ACC) learning framework, integrating both partial observations and oracle states. ACC achieves this by coordinating two critics and invoking a maximization operation mechanism to switch between them dynamically. This approach encourages the selection of the higher values when computing advantages within the actor–critic framework, thereby accelerating learning and mitigating bias under partial observability. Some theoretical analyses show that ACC exhibits better learning ability toward optimal policies than actor–critic learning using the oracle states. We highlight its superior performance through comprehensive evaluations in decision-making tasks, such as <italic>QuestBall, <italic>Minigrid, and <italic>Atari, and the challenging card game <italic>DouDizhu.","PeriodicalId":55977,"journal":{"name":"IEEE Transactions on Games","volume":"17 3","pages":"765-776"},"PeriodicalIF":2.8000,"publicationDate":"2025-02-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Leveraging Privileged Information for Partially Observable Reinforcement Learning\",\"authors\":\"Jinqiu Li;Enmin Zhao;Tong Wei;Junliang Xing;Shiming Xiang\",\"doi\":\"10.1109/TG.2025.3542158\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Reinforcement learning has achieved remarkable success across diverse scenarios. However, learning optimal policies within partially observable games remains a formidable challenge. Crucial privileged information in states is often shrouded during gameplay, yet ideally, it should be accessible and exploitable during training. Previous studies have concentrated on formulating policies based wholly on partial observations or oracle states. Nevertheless, these approaches often face hindrances in attaining effective generalization. To surmount this challenge, we propose the actor–cross-critic (ACC) learning framework, integrating both partial observations and oracle states. ACC achieves this by coordinating two critics and invoking a maximization operation mechanism to switch between them dynamically. This approach encourages the selection of the higher values when computing advantages within the actor–critic framework, thereby accelerating learning and mitigating bias under partial observability. Some theoretical analyses show that ACC exhibits better learning ability toward optimal policies than actor–critic learning using the oracle states. We highlight its superior performance through comprehensive evaluations in decision-making tasks, such as <italic>QuestBall, <italic>Minigrid, and <italic>Atari, and the challenging card game <italic>DouDizhu.\",\"PeriodicalId\":55977,\"journal\":{\"name\":\"IEEE Transactions on Games\",\"volume\":\"17 3\",\"pages\":\"765-776\"},\"PeriodicalIF\":2.8000,\"publicationDate\":\"2025-02-13\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"IEEE Transactions on Games\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://ieeexplore.ieee.org/document/10887124/\",\"RegionNum\":4,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q3\",\"JCRName\":\"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Games","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/10887124/","RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

摘要

强化学习在不同的场景中取得了显著的成功。然而，在部分可观察的游戏中学习最佳策略仍然是一项艰巨的挑战。国家中的关键特权信息通常在游戏过程中被掩盖，但理想情况下，它应该在训练过程中被访问和利用。以前的研究集中在完全基于部分观察或神谕状态来制定政策。然而，这些方法在获得有效泛化方面经常面临障碍。为了克服这一挑战，我们提出了行动者-跨批评家（ACC）学习框架，整合了部分观察和预言状态。ACC通过协调两个批评家并调用最大化操作机制在它们之间动态切换来实现这一点。这种方法鼓励在行动者-批评者框架内计算优势时选择较高的值，从而加速学习并减轻部分可观察性下的偏见。一些理论分析表明，ACC对最优策略的学习能力优于使用神谕状态的行动者-批评者学习。我们通过在决策任务（如QuestBall、Minigrid和Atari）以及具有挑战性的纸牌游戏豆滴珠（DouDizhu）中的综合评估来突出其优越的性能。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Leveraging Privileged Information for Partially Observable Reinforcement Learning

Reinforcement learning has achieved remarkable success across diverse scenarios. However, learning optimal policies within partially observable games remains a formidable challenge. Crucial privileged information in states is often shrouded during gameplay, yet ideally, it should be accessible and exploitable during training. Previous studies have concentrated on formulating policies based wholly on partial observations or oracle states. Nevertheless, these approaches often face hindrances in attaining effective generalization. To surmount this challenge, we propose the actor–cross-critic (ACC) learning framework, integrating both partial observations and oracle states. ACC achieves this by coordinating two critics and invoking a maximization operation mechanism to switch between them dynamically. This approach encourages the selection of the higher values when computing advantages within the actor–critic framework, thereby accelerating learning and mitigating bias under partial observability. Some theoretical analyses show that ACC exhibits better learning ability toward optimal policies than actor–critic learning using the oracle states. We highlight its superior performance through comprehensive evaluations in decision-making tasks, such as QuestBall, Minigrid, and Atari, and the challenging card game DouDizhu.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

IEEE Transactions on Games Engineering-Electrical and Electronic Engineering

CiteScore

4.60

自引率

8.70%

发文量