Diversifying Policies With Non-Markov Dispersion to Expand the Solution Space

IEEE transactions on pattern analysis and machine intelligence Pub Date : 2024-09-06 DOI:10.1109/TPAMI.2024.3455257

Bohao Qu;Xiaofeng Cao;Yi Chang;Ivor W. Tsang;Yew-Soon Ong

{"title":"Diversifying Policies With Non-Markov Dispersion to Expand the Solution Space","authors":"Bohao Qu;Xiaofeng Cao;Yi Chang;Ivor W. Tsang;Yew-Soon Ong","doi":"10.1109/TPAMI.2024.3455257","DOIUrl":null,"url":null,"abstract":"Policy diversity, encompassing the variety of policies an agent can adopt, enhances reinforcement learning (RL) success by fostering more robust, adaptable, and innovative problem-solving in the environment. The environment in which standard RL operates is usually modeled with a Markov Decision Process (MDP) as the theoretical foundation. However, in many real-world scenarios, the rewards depend on an agent's history of states and actions leading to a non-MDP. Under the premise of policy diffusion initialization, non-MDPs may have unstructured expanding solution space due to varying historical information and temporal dependencies. This results in solutions having non-equivalent closed forms in non-MDPs. In this paper, deriving diverse solutions for non-MDPs requires policies to break through the boundaries of the current solution space through gradual dispersion. The goal is to expand the solution space, thereby obtaining more diverse policies. Specifically, we first model the sequences of states and actions by a transformer-based method to learn policy embeddings for dispersion in the solution space, since the transformer has advantages in handling sequential data and capturing long-range dependencies for non-MDP. Then, we stack the policy embeddings to construct a dispersion matrix as the policy diversity measure to induce the policy dispersion in the solution space and obtain a set of diverse policies. Finally, we prove that if the dispersion matrix is positive definite, the dispersed embeddings can effectively enlarge the disagreements across policies, yielding a diverse expression for the original policy embedding distribution. Experimental results of both non-MDP and MDP environments show that this dispersion scheme can obtain more expressive diverse policies via expanding the solution space, showing more robust performance than the recent learning baselines.","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":"46 12","pages":"11392-11408"},"PeriodicalIF":0.0000,"publicationDate":"2024-09-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE transactions on pattern analysis and machine intelligence","FirstCategoryId":"1085","ListUrlMain":"https://ieeexplore.ieee.org/document/10668823/","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Policy diversity, encompassing the variety of policies an agent can adopt, enhances reinforcement learning (RL) success by fostering more robust, adaptable, and innovative problem-solving in the environment. The environment in which standard RL operates is usually modeled with a Markov Decision Process (MDP) as the theoretical foundation. However, in many real-world scenarios, the rewards depend on an agent's history of states and actions leading to a non-MDP. Under the premise of policy diffusion initialization, non-MDPs may have unstructured expanding solution space due to varying historical information and temporal dependencies. This results in solutions having non-equivalent closed forms in non-MDPs. In this paper, deriving diverse solutions for non-MDPs requires policies to break through the boundaries of the current solution space through gradual dispersion. The goal is to expand the solution space, thereby obtaining more diverse policies. Specifically, we first model the sequences of states and actions by a transformer-based method to learn policy embeddings for dispersion in the solution space, since the transformer has advantages in handling sequential data and capturing long-range dependencies for non-MDP. Then, we stack the policy embeddings to construct a dispersion matrix as the policy diversity measure to induce the policy dispersion in the solution space and obtain a set of diverse policies. Finally, we prove that if the dispersion matrix is positive definite, the dispersed embeddings can effectively enlarge the disagreements across policies, yielding a diverse expression for the original policy embedding distribution. Experimental results of both non-MDP and MDP environments show that this dispersion scheme can obtain more expressive diverse policies via expanding the solution space, showing more robust performance than the recent learning baselines.

查看原文本刊更多论文

利用非马尔可夫离散性的多样化政策来扩展求解空间

策略多样性包括一个代理可以采用的各种策略，它通过在环境中促进更稳健、适应性更强和更具创新性的问题解决，来提高强化学习（RL）的成功率。标准强化学习的运行环境通常以马尔可夫决策过程（MDP）作为理论基础。然而，在现实世界的许多场景中，奖励取决于代理的历史状态和行动，从而导致非马尔可夫决策过程。在策略扩散初始化的前提下，由于历史信息和时间依赖性的不同，非 MDP 可能会有非结构化的扩展解空间。这就导致非 MDP 中的解具有非等价的封闭形式。在本文中，要推导出非 MDPs 的多样化解，需要政策通过逐步分散来突破当前解空间的边界。我们的目标是扩大解空间，从而获得更多样化的策略。具体来说，由于变换器在处理顺序数据和捕捉非 MDP 的长程依赖性方面具有优势，因此我们首先通过基于变换器的方法对状态和行动序列进行建模，以学习解空间中分散的策略嵌入。然后，我们通过堆叠策略嵌入来构建分散矩阵作为策略多样性度量，从而诱导解空间中的策略分散，并得到一组多样性策略。最后，我们证明了如果分散矩阵是正定的，分散的嵌入可以有效地扩大政策间的分歧，从而得到原始政策嵌入分布的多样性表达式。非 MDP 和 MDP 环境的实验结果表明，这种分散方案可以通过扩大解空间获得更具表现力的多样化策略，与最近的学习基线相比，表现出更强的鲁棒性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

IEEE transactions on pattern analysis and machine intelligence

自引率

0.00%

发文量