Trajectory representation learning for Multi-Task NMRDP planning

2020 25th International Conference on Pattern Recognition (ICPR) Pub Date : 2021-01-10 DOI:10.1109/ICPR48806.2021.9412601

Firas Jarboui, Vianney Perchet

引用次数: 1

Abstract

Expanding Non Markovian Reward Decision Processes (NMRDP) into Markov Decision Processes (MDP) enables the use of state of the art Reinforcement Learning (RL) techniques to identify optimal policies. In this paper an approach to exploring NMRDPs and expanding them into MDPs, without the prior knowledge of the reward structure, is proposed. The non Markovianity of the reward function is disentangled under the assumption that sets of similar and dissimilar trajectory batches can be sampled. More precisely, within the same batch, measuring the similarity between any couple of trajectories is permitted, although comparing trajectories from different batches is not possible. A modified version of the triplet loss is optimised to construct a representation of the trajectories under which rewards become Markovian.

查看原文本刊更多论文

多任务NMRDP规划的轨迹表示学习

将非马尔可夫奖励决策过程(NMRDP)扩展为马尔可夫决策过程(MDP)，可以使用最先进的强化学习(RL)技术来确定最佳策略。本文提出了一种方法来探索nmrdp并将其扩展为mdp，而不需要事先了解奖励结构。在假设可以抽取相似和不相似轨迹批的情况下，解离了奖励函数的非马尔可夫性。更准确地说，在同一批次内，测量任何一对轨迹之间的相似性是允许的，尽管比较来自不同批次的轨迹是不可能的。对三重损失的改进版本进行了优化，以构建奖励变为马尔可夫的轨迹的表示。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2020 25th International Conference on Pattern Recognition (ICPR)

自引率

0.00%

发文量