Active Inference and Reinforcement Learning: A Unified Inference on Continuous State and Action Spaces Under Partial Observability

IF 2.7 4区 计算机科学 Q3 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE
Parvin Malekzadeh;Konstantinos N. Plataniotis
{"title":"Active Inference and Reinforcement Learning: A Unified Inference on Continuous State and Action Spaces Under Partial Observability","authors":"Parvin Malekzadeh;Konstantinos N. Plataniotis","doi":"10.1162/neco_a_01698","DOIUrl":null,"url":null,"abstract":"Reinforcement learning (RL) has garnered significant attention for developing decision-making agents that aim to maximize rewards, specified by an external supervisor, within fully observable environments. However, many real-world problems involve partial or noisy observations, where agents cannot access complete and accurate information about the environment. These problems are commonly formulated as partially observable Markov decision processes (POMDPs). Previous studies have tackled RL in POMDPs by either incorporating the memory of past actions and observations or by inferring the true state of the environment from observed data. Nevertheless, aggregating observations and actions over time becomes impractical in problems with large decision-making time horizons and high-dimensional spaces. Furthermore, inference-based RL approaches often require many environmental samples to perform well, as they focus solely on reward maximization and neglect uncertainty in the inferred state. Active inference (AIF) is a framework naturally formulated in POMDPs and directs agents to select actions by minimizing a function called expected free energy (EFE). This supplies reward-maximizing (or exploitative) behavior, as in RL, with information-seeking (or exploratory) behavior. Despite this exploratory behavior of AIF, its use is limited to problems with small time horizons and discrete spaces due to the computational challenges associated with EFE. In this article, we propose a unified principle that establishes a theoretical connection between AIF and RL, enabling seamless integration of these two approaches and overcoming their limitations in continuous space POMDP settings. We substantiate our findings with rigorous theoretical analysis, providing novel perspectives for using AIF in designing and implementing artificial agents. Experimental results demonstrate the superior learning capabilities of our method compared to other alternative RL approaches in solving partially observable tasks with continuous spaces. Notably, our approach harnesses information-seeking exploration, enabling it to effectively solve reward-free problems and rendering explicit task reward design by an external supervisor optional.","PeriodicalId":54731,"journal":{"name":"Neural Computation","volume":"36 10","pages":"2073-2135"},"PeriodicalIF":2.7000,"publicationDate":"2024-09-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Neural Computation","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/10713894/","RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
引用次数: 0

Abstract

Reinforcement learning (RL) has garnered significant attention for developing decision-making agents that aim to maximize rewards, specified by an external supervisor, within fully observable environments. However, many real-world problems involve partial or noisy observations, where agents cannot access complete and accurate information about the environment. These problems are commonly formulated as partially observable Markov decision processes (POMDPs). Previous studies have tackled RL in POMDPs by either incorporating the memory of past actions and observations or by inferring the true state of the environment from observed data. Nevertheless, aggregating observations and actions over time becomes impractical in problems with large decision-making time horizons and high-dimensional spaces. Furthermore, inference-based RL approaches often require many environmental samples to perform well, as they focus solely on reward maximization and neglect uncertainty in the inferred state. Active inference (AIF) is a framework naturally formulated in POMDPs and directs agents to select actions by minimizing a function called expected free energy (EFE). This supplies reward-maximizing (or exploitative) behavior, as in RL, with information-seeking (or exploratory) behavior. Despite this exploratory behavior of AIF, its use is limited to problems with small time horizons and discrete spaces due to the computational challenges associated with EFE. In this article, we propose a unified principle that establishes a theoretical connection between AIF and RL, enabling seamless integration of these two approaches and overcoming their limitations in continuous space POMDP settings. We substantiate our findings with rigorous theoretical analysis, providing novel perspectives for using AIF in designing and implementing artificial agents. Experimental results demonstrate the superior learning capabilities of our method compared to other alternative RL approaches in solving partially observable tasks with continuous spaces. Notably, our approach harnesses information-seeking exploration, enabling it to effectively solve reward-free problems and rendering explicit task reward design by an external supervisor optional.
主动推理与强化学习:部分可观测性下连续状态和行动空间的统一推理》(A Unified Inference on Continuous State and Action Spaces under Partial Observability.
强化学习(RL)在开发决策代理方面备受关注,这些代理的目标是在完全可观测的环境中最大限度地提高外部监督者指定的奖励。然而,现实世界中的许多问题都涉及部分或嘈杂的观测,在这些问题中,代理无法获得有关环境的完整而准确的信息。这些问题通常被表述为部分可观测马尔可夫决策过程(POMDP)。以往的研究通过结合对过去行动和观察结果的记忆,或通过从观察数据推断环境的真实状态,来解决 POMDPs 中的 RL 问题。然而,在决策时间跨度大、空间维度高的问题中,随着时间的推移汇总观察结果和行动是不切实际的。此外,基于推理的 RL 方法往往需要许多环境样本才能取得良好效果,因为它们只关注报酬最大化,而忽略了推理状态的不确定性。主动推理(AIF)是在 POMDPs 中自然形成的一个框架,它指导代理通过最小化称为期望自由能(EFE)的函数来选择行动。这将 RL 中的报酬最大化(或利用)行为与信息搜索(或探索)行为结合起来。尽管 AIF 具有探索行为,但由于 EFE 带来的计算挑战,它的应用仅限于小时间跨度和离散空间的问题。在本文中,我们提出了一个统一的原则,在 AIF 和 RL 之间建立了理论联系,实现了这两种方法的无缝集成,克服了它们在连续空间 POMDP 设置中的局限性。我们通过严谨的理论分析证实了我们的发现,为使用 AIF 设计和实现人工代理提供了新的视角。实验结果表明,在解决连续空间的部分可观测任务时,与其他可供选择的 RL 方法相比,我们的方法具有更强的学习能力。值得注意的是,我们的方法利用了信息搜索探索,使其能够有效地解决无奖励问题,并使外部监督者的明确任务奖励设计变得可有可无。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
Neural Computation
Neural Computation 工程技术-计算机:人工智能
CiteScore
6.30
自引率
3.40%
发文量
83
审稿时长
3.0 months
期刊介绍: Neural Computation is uniquely positioned at the crossroads between neuroscience and TMCS and welcomes the submission of original papers from all areas of TMCS, including: Advanced experimental design; Analysis of chemical sensor data; Connectomic reconstructions; Analysis of multielectrode and optical recordings; Genetic data for cell identity; Analysis of behavioral data; Multiscale models; Analysis of molecular mechanisms; Neuroinformatics; Analysis of brain imaging data; Neuromorphic engineering; Principles of neural coding, computation, circuit dynamics, and plasticity; Theories of brain function.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信