Continuous Deep Maximum Entropy Inverse Reinforcement Learning using online POMDP

2019 19th International Conference on Advanced Robotics (ICAR) Pub Date : 2019-12-01 DOI:10.1109/ICAR46387.2019.8981548

Júnior A. R. Silva, V. Grassi, D. Wolf

{"title":"Continuous Deep Maximum Entropy Inverse Reinforcement Learning using online POMDP","authors":"Júnior A. R. Silva, V. Grassi, D. Wolf","doi":"10.1109/ICAR46387.2019.8981548","DOIUrl":null,"url":null,"abstract":"A vehicle navigating in an urban environment must obey traffic rules by properly setting its speed, such as staying below the road speed limit and avoiding collision with other vehicles. This is presumably the scenario that autonomous vehicles will face: they will share the traffic roads with other vehicles (autonomous or not), cooperatively interacting with them. In other words, autonomous vehicles should not only follow traffic rules, but should also behave in such a way that resembles other vehicles behavior. However, manually specification of such behavior is a time-consuming and error-prone task, since driving in urban roads is a complex task, which involves many factors. This paper presents a multitask decision making framework that learns an expert driver's behavior driving in an urban scenario containing traffic lights and other vehicles. For this purpose, Inverse Reinforcement Learning (IRL) is used to learn a reward function that explains the expert driver's behavior. Most IRL approaches require solving a Markov Decision Process (MDP) in each iteration of the algorithm to compute the optimal policy given the current rewards. Nevertheless, the computational cost of solving an MDP is high when considering large state spaces. To overcome this issue, the optimal policy is estimated by sampling trajectories in regions of the space with higher rewards. To do so, the problem is modeled as a continuous Partially Observed Markov Decision Process (POMDP), in which the intentions of other vehicles are only partially observed. An online solver is employed in order to sample trajectories given the current rewards. The efficiency of the proposed framework is demonstrated through simulations, showing that the controlled vehicle is be able to mimic an expert driver's behavior.","PeriodicalId":6606,"journal":{"name":"2019 19th International Conference on Advanced Robotics (ICAR)","volume":"20 1","pages":"382-387"},"PeriodicalIF":0.0000,"publicationDate":"2019-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"4","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2019 19th International Conference on Advanced Robotics (ICAR)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICAR46387.2019.8981548","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 4

Abstract

A vehicle navigating in an urban environment must obey traffic rules by properly setting its speed, such as staying below the road speed limit and avoiding collision with other vehicles. This is presumably the scenario that autonomous vehicles will face: they will share the traffic roads with other vehicles (autonomous or not), cooperatively interacting with them. In other words, autonomous vehicles should not only follow traffic rules, but should also behave in such a way that resembles other vehicles behavior. However, manually specification of such behavior is a time-consuming and error-prone task, since driving in urban roads is a complex task, which involves many factors. This paper presents a multitask decision making framework that learns an expert driver's behavior driving in an urban scenario containing traffic lights and other vehicles. For this purpose, Inverse Reinforcement Learning (IRL) is used to learn a reward function that explains the expert driver's behavior. Most IRL approaches require solving a Markov Decision Process (MDP) in each iteration of the algorithm to compute the optimal policy given the current rewards. Nevertheless, the computational cost of solving an MDP is high when considering large state spaces. To overcome this issue, the optimal policy is estimated by sampling trajectories in regions of the space with higher rewards. To do so, the problem is modeled as a continuous Partially Observed Markov Decision Process (POMDP), in which the intentions of other vehicles are only partially observed. An online solver is employed in order to sample trajectories given the current rewards. The efficiency of the proposed framework is demonstrated through simulations, showing that the controlled vehicle is be able to mimic an expert driver's behavior.

查看原文本刊更多论文

使用在线POMDP的连续深度最大熵逆强化学习

在城市环境中行驶的车辆必须遵守交通规则，适当设置速度，如保持在道路限速以下，避免与其他车辆发生碰撞。这大概是自动驾驶汽车将面临的场景:它们将与其他车辆(无论是否自动驾驶)共享交通道路，并与它们合作互动。换句话说，自动驾驶汽车不仅要遵守交通规则，还应该以类似于其他车辆的行为方式行事。然而，手动规范这种行为是一项耗时且容易出错的任务，因为在城市道路上驾驶是一项复杂的任务，涉及许多因素。本文提出了一个多任务决策框架，该框架可以学习专家驾驶员在包含交通灯和其他车辆的城市场景中的驾驶行为。为此，使用逆强化学习(IRL)来学习解释专家驾驶员行为的奖励函数。大多数IRL方法需要在算法的每次迭代中求解马尔可夫决策过程(MDP)来计算给定当前奖励的最优策略。然而，在考虑大型状态空间时，求解MDP的计算成本很高。为了克服这个问题，通过在具有较高奖励的空间区域中采样轨迹来估计最优策略。为此，该问题被建模为连续的部分可观察马尔可夫决策过程(POMDP)，其中其他车辆的意图仅被部分观察到。为了对给定当前奖励的轨迹进行采样，使用了在线求解器。通过仿真验证了所提框架的有效性，表明被控车辆能够模仿专家驾驶员的行为。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2019 19th International Conference on Advanced Robotics (ICAR)

自引率

0.00%

发文量