任务导向IRL在pomdp中的扩展

Artif. Intell. Pub Date : 2022-12-30 DOI:10.48550/arXiv.2301.01219

Franck Djeumou, Christian Ellis, Murat Cubuktepe, Craig T. Lennon, U. Topcu

{"title":"任务导向IRL在pomdp中的扩展","authors":"Franck Djeumou, Christian Ellis, Murat Cubuktepe, Craig T. Lennon, U. Topcu","doi":"10.48550/arXiv.2301.01219","DOIUrl":null,"url":null,"abstract":"In inverse reinforcement learning (IRL), a learning agent infers a reward function encoding the underlying task using demonstrations from experts. However, many existing IRL techniques make the often unrealistic assumption that the agent has access to full information about the environment. We remove this assumption by developing an algorithm for IRL in partially observable Markov decision processes (POMDPs). We address two limitations of existing IRL techniques. First, they require an excessive amount of data due to the information asymmetry between the expert and the learner. Second, most of these IRL techniques require solving the computationally intractable forward problem -- computing an optimal policy given a reward function -- in POMDPs. The developed algorithm reduces the information asymmetry while increasing the data efficiency by incorporating task specifications expressed in temporal logic into IRL. Such specifications may be interpreted as side information available to the learner a priori in addition to the demonstrations. Further, the algorithm avoids a common source of algorithmic complexity by building on causal entropy as the measure of the likelihood of the demonstrations as opposed to entropy. Nevertheless, the resulting problem is nonconvex due to the so-called forward problem. We solve the intrinsic nonconvexity of the forward problem in a scalable manner through a sequential linear programming scheme that guarantees to converge to a locally optimal policy. In a series of examples, including experiments in a high-fidelity Unity simulator, we demonstrate that even with a limited amount of data and POMDPs with tens of thousands of states, our algorithm learns reward functions and policies that satisfy the task while inducing similar behavior to the expert by leveraging the provided side information.","PeriodicalId":8496,"journal":{"name":"Artif. Intell.","volume":"46 1","pages":"103856"},"PeriodicalIF":0.0000,"publicationDate":"2022-12-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Task-Guided IRL in POMDPs that Scales\",\"authors\":\"Franck Djeumou, Christian Ellis, Murat Cubuktepe, Craig T. Lennon, U. Topcu\",\"doi\":\"10.48550/arXiv.2301.01219\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"In inverse reinforcement learning (IRL), a learning agent infers a reward function encoding the underlying task using demonstrations from experts. However, many existing IRL techniques make the often unrealistic assumption that the agent has access to full information about the environment. We remove this assumption by developing an algorithm for IRL in partially observable Markov decision processes (POMDPs). We address two limitations of existing IRL techniques. First, they require an excessive amount of data due to the information asymmetry between the expert and the learner. Second, most of these IRL techniques require solving the computationally intractable forward problem -- computing an optimal policy given a reward function -- in POMDPs. The developed algorithm reduces the information asymmetry while increasing the data efficiency by incorporating task specifications expressed in temporal logic into IRL. Such specifications may be interpreted as side information available to the learner a priori in addition to the demonstrations. Further, the algorithm avoids a common source of algorithmic complexity by building on causal entropy as the measure of the likelihood of the demonstrations as opposed to entropy. Nevertheless, the resulting problem is nonconvex due to the so-called forward problem. We solve the intrinsic nonconvexity of the forward problem in a scalable manner through a sequential linear programming scheme that guarantees to converge to a locally optimal policy. In a series of examples, including experiments in a high-fidelity Unity simulator, we demonstrate that even with a limited amount of data and POMDPs with tens of thousands of states, our algorithm learns reward functions and policies that satisfy the task while inducing similar behavior to the expert by leveraging the provided side information.\",\"PeriodicalId\":8496,\"journal\":{\"name\":\"Artif. Intell.\",\"volume\":\"46 1\",\"pages\":\"103856\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2022-12-30\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Artif. Intell.\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.48550/arXiv.2301.01219\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Artif. Intell.","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.48550/arXiv.2301.01219","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

在逆强化学习(IRL)中，学习代理通过专家的演示推断出编码底层任务的奖励函数。然而，许多现有的IRL技术常常不切实际地假设代理可以访问有关环境的全部信息。我们通过开发部分可观察马尔可夫决策过程(pomdp)中的IRL算法来消除这一假设。我们解决了现有IRL技术的两个限制。首先，由于专家和学习者之间的信息不对称，它们需要大量的数据。其次，在pomdp中，大多数IRL技术需要解决计算上难以解决的前向问题——计算给定奖励函数的最优策略。该算法通过将时间逻辑表示的任务规范融入到IRL中，减少了信息不对称，提高了数据效率。除了演示之外，这些说明可以被解释为学习者先验地获得的附加信息。此外，该算法通过建立因果熵作为演示可能性的度量，而不是熵，从而避免了算法复杂性的常见来源。然而，由于所谓的前向问题，结果问题是非凸的。通过保证收敛于局部最优策略的顺序线性规划方案，以可伸缩的方式解决了前向问题的固有非凸性。在一系列例子中，包括在高保真Unity模拟器中的实验，我们证明了即使在有限数量的数据和具有数万个状态的pomdp中，我们的算法也可以学习满足任务的奖励函数和策略，同时通过利用提供的侧信息诱导与专家相似的行为。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Task-Guided IRL in POMDPs that Scales

In inverse reinforcement learning (IRL), a learning agent infers a reward function encoding the underlying task using demonstrations from experts. However, many existing IRL techniques make the often unrealistic assumption that the agent has access to full information about the environment. We remove this assumption by developing an algorithm for IRL in partially observable Markov decision processes (POMDPs). We address two limitations of existing IRL techniques. First, they require an excessive amount of data due to the information asymmetry between the expert and the learner. Second, most of these IRL techniques require solving the computationally intractable forward problem -- computing an optimal policy given a reward function -- in POMDPs. The developed algorithm reduces the information asymmetry while increasing the data efficiency by incorporating task specifications expressed in temporal logic into IRL. Such specifications may be interpreted as side information available to the learner a priori in addition to the demonstrations. Further, the algorithm avoids a common source of algorithmic complexity by building on causal entropy as the measure of the likelihood of the demonstrations as opposed to entropy. Nevertheless, the resulting problem is nonconvex due to the so-called forward problem. We solve the intrinsic nonconvexity of the forward problem in a scalable manner through a sequential linear programming scheme that guarantees to converge to a locally optimal policy. In a series of examples, including experiments in a high-fidelity Unity simulator, we demonstrate that even with a limited amount of data and POMDPs with tens of thousands of states, our algorithm learns reward functions and policies that satisfy the task while inducing similar behavior to the expert by leveraging the provided side information.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Artif. Intell.

自引率

0.00%

发文量