通过依赖未来的选项实现时间逻辑任务的通用化

IF 2.9 3区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Machine Learning Pub Date : 2024-08-26 DOI:10.1007/s10994-024-06614-y

Duo Xu, Faramarz Fekri

{"title":"通过依赖未来的选项实现时间逻辑任务的通用化","authors":"Duo Xu, Faramarz Fekri","doi":"10.1007/s10994-024-06614-y","DOIUrl":null,"url":null,"abstract":"<p>Temporal logic (TL) tasks consist of complex and temporally extended subgoals and they are common for many real-world applications, such as service and navigation robots. However, it is often inefficient or even infeasible to train reinforcement learning (RL) agents to solve multiple TL tasks, since rewards are sparse and non-Markovian in these tasks. A promising solution to this problem is to learn task-conditioned policies which can zero-shot generalize to new TL tasks without further training. However, influenced by some practical issues, such as issues of lossy symbolic observation and long time-horizon of completing TL task, previous works suffer from sample inefficiency in training and sub-optimality (or even infeasibility) in task execution. In order to tackle these issues, this paper proposes an option-based framework to generalize TL tasks, consisting of option training and task execution parts. We have innovations in both parts. In option training, we propose to learn options dependent on the future subgoals via a novel approach. Additionally, we propose to train a multi-step value function which can propagate the rewards of satisfying future subgoals more efficiently in long-horizon tasks. In task execution, in order to ensure the optimality and safety, we propose a model-free MPC planner for option selection, circumventing the learning of a transition model which is required by previous MPC planners. In experiments on three different domains, we evaluate the generalization capability of the agent trained by the proposed method, showing its significant advantage over previous methods.</p>","PeriodicalId":49900,"journal":{"name":"Machine Learning","volume":"317 1","pages":""},"PeriodicalIF":2.9000,"publicationDate":"2024-08-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Generalization of temporal logic tasks via future dependent options\",\"authors\":\"Duo Xu, Faramarz Fekri\",\"doi\":\"10.1007/s10994-024-06614-y\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p>Temporal logic (TL) tasks consist of complex and temporally extended subgoals and they are common for many real-world applications, such as service and navigation robots. However, it is often inefficient or even infeasible to train reinforcement learning (RL) agents to solve multiple TL tasks, since rewards are sparse and non-Markovian in these tasks. A promising solution to this problem is to learn task-conditioned policies which can zero-shot generalize to new TL tasks without further training. However, influenced by some practical issues, such as issues of lossy symbolic observation and long time-horizon of completing TL task, previous works suffer from sample inefficiency in training and sub-optimality (or even infeasibility) in task execution. In order to tackle these issues, this paper proposes an option-based framework to generalize TL tasks, consisting of option training and task execution parts. We have innovations in both parts. In option training, we propose to learn options dependent on the future subgoals via a novel approach. Additionally, we propose to train a multi-step value function which can propagate the rewards of satisfying future subgoals more efficiently in long-horizon tasks. In task execution, in order to ensure the optimality and safety, we propose a model-free MPC planner for option selection, circumventing the learning of a transition model which is required by previous MPC planners. In experiments on three different domains, we evaluate the generalization capability of the agent trained by the proposed method, showing its significant advantage over previous methods.</p>\",\"PeriodicalId\":49900,\"journal\":{\"name\":\"Machine Learning\",\"volume\":\"317 1\",\"pages\":\"\"},\"PeriodicalIF\":2.9000,\"publicationDate\":\"2024-08-26\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Machine Learning\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://doi.org/10.1007/s10994-024-06614-y\",\"RegionNum\":3,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Machine Learning","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1007/s10994-024-06614-y","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

摘要

时间逻辑（TL）任务由复杂的时间扩展子目标组成，在服务和导航机器人等许多现实世界的应用中很常见。然而，训练强化学习（RL）代理来解决多个 TL 任务往往效率低下，甚至不可行，因为在这些任务中，奖励是稀疏和非马尔可夫的。解决这一问题的一个很有前景的办法是学习任务条件策略，这种策略无需进一步训练就能对新的 TL 任务进行零点泛化。然而，受一些实际问题的影响，例如有损符号观测和完成 TL 任务的时间跨度较长等问题，以前的工作在训练中存在样本效率低下的问题，在任务执行中存在次优性（甚至不可行性）。为了解决这些问题，本文提出了一个基于选项的框架来泛化 TL 任务，由选项训练和任务执行两部分组成。我们在这两部分都有所创新。在选项训练中，我们建议通过一种新颖的方法来学习与未来子目标相关的选项。此外，我们还建议训练一个多步骤价值函数，它可以在长视距任务中更有效地传播满足未来子目标的奖励。在任务执行过程中，为了确保最优性和安全性，我们提出了一种用于选项选择的无模型 MPC 计划程序，避免了以往 MPC 计划程序所要求的过渡模型学习。在三个不同领域的实验中，我们评估了用所提方法训练的代理的泛化能力，结果表明它比以前的方法有显著优势。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

Generalization of temporal logic tasks via future dependent options

查看原文本刊更多论文

Generalization of temporal logic tasks via future dependent options

Temporal logic (TL) tasks consist of complex and temporally extended subgoals and they are common for many real-world applications, such as service and navigation robots. However, it is often inefficient or even infeasible to train reinforcement learning (RL) agents to solve multiple TL tasks, since rewards are sparse and non-Markovian in these tasks. A promising solution to this problem is to learn task-conditioned policies which can zero-shot generalize to new TL tasks without further training. However, influenced by some practical issues, such as issues of lossy symbolic observation and long time-horizon of completing TL task, previous works suffer from sample inefficiency in training and sub-optimality (or even infeasibility) in task execution. In order to tackle these issues, this paper proposes an option-based framework to generalize TL tasks, consisting of option training and task execution parts. We have innovations in both parts. In option training, we propose to learn options dependent on the future subgoals via a novel approach. Additionally, we propose to train a multi-step value function which can propagate the rewards of satisfying future subgoals more efficiently in long-horizon tasks. In task execution, in order to ensure the optimality and safety, we propose a model-free MPC planner for option selection, circumventing the learning of a transition model which is required by previous MPC planners. In experiments on three different domains, we evaluate the generalization capability of the agent trained by the proposed method, showing its significant advantage over previous methods.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Machine Learning 工程技术-计算机：人工智能

CiteScore

11.00

自引率

2.70%

发文量

162

审稿时长

3 months

期刊介绍： Machine Learning serves as a global platform dedicated to computational approaches in learning. The journal reports substantial findings on diverse learning methods applied to various problems, offering support through empirical studies, theoretical analysis, or connections to psychological phenomena. It demonstrates the application of learning methods to solve significant problems and aims to enhance the conduct of machine learning research with a focus on verifiable and replicable evidence in published papers.