价值迭代、策略迭代和q -学习在解决决策问题中的比较

2021 International Conference on Unmanned Aircraft Systems (ICUAS) Pub Date : 2021-06-15 DOI:10.1109/ICUAS51884.2021.9476691

M. Hamadouche, C. Dezan, D. Espès, K. Branco

{"title":"价值迭代、策略迭代和q -学习在解决决策问题中的比较","authors":"M. Hamadouche, C. Dezan, D. Espès, K. Branco","doi":"10.1109/ICUAS51884.2021.9476691","DOIUrl":null,"url":null,"abstract":"21st century has seen a lot of progress, especially in robotics. Today, the evolution of electronics and computing capacities allows to develop more precise, faster and autonomous robots. They are able to automatically perform certain delicate or dangerous tasks. Robots should move, perceive their environment and make decisions by taking into account the goal(s) of a mission under uncertainty. One of the most current probabilistic model for description of missions and for planning under uncertainty is Markov Decision Process (MDP). In addition, there are three fundamental classes of methods for solving these MDPs: dynamic programming, Monte Carlo methods, and temporal difference learning. Each class of methods has its strengths and weaknesses. In this paper, we present our comparison on three methods for solving MDPs, Value Iteration and Policy Iteration (Dynamic Programming methods) and Q-Learning (Temporal-Difference method). We give new criteria to adapt the decision-making method to the application problem, with the parameters explanations. Policy Iteration is the most effective method for complex (and irregular) scenarios, and the modified Q-Learning for simple (and regular) scenarios. So, the regularity aspect of the decision-making has to be taken into account to choose the most appropriate resolution method in terms of execution time. Numerical simulation shows the conclusion results over simple and regular case of the grid, over the irregular case of the grid example and finally over the mission planning of an Unmanned Aerial Vehicle (UAV), representing is a very irregular case. We demonstrate that the Dynamic Programming (DP) methods are more efficient methods than the Temporal-Difference (TD) method while facing an irregular set of actions.","PeriodicalId":423195,"journal":{"name":"2021 International Conference on Unmanned Aircraft Systems (ICUAS)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-06-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":"{\"title\":\"Comparison of Value Iteration, Policy Iteration and Q-Learning for solving Decision-Making problems\",\"authors\":\"M. Hamadouche, C. Dezan, D. Espès, K. Branco\",\"doi\":\"10.1109/ICUAS51884.2021.9476691\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"21st century has seen a lot of progress, especially in robotics. Today, the evolution of electronics and computing capacities allows to develop more precise, faster and autonomous robots. They are able to automatically perform certain delicate or dangerous tasks. Robots should move, perceive their environment and make decisions by taking into account the goal(s) of a mission under uncertainty. One of the most current probabilistic model for description of missions and for planning under uncertainty is Markov Decision Process (MDP). In addition, there are three fundamental classes of methods for solving these MDPs: dynamic programming, Monte Carlo methods, and temporal difference learning. Each class of methods has its strengths and weaknesses. In this paper, we present our comparison on three methods for solving MDPs, Value Iteration and Policy Iteration (Dynamic Programming methods) and Q-Learning (Temporal-Difference method). We give new criteria to adapt the decision-making method to the application problem, with the parameters explanations. Policy Iteration is the most effective method for complex (and irregular) scenarios, and the modified Q-Learning for simple (and regular) scenarios. So, the regularity aspect of the decision-making has to be taken into account to choose the most appropriate resolution method in terms of execution time. Numerical simulation shows the conclusion results over simple and regular case of the grid, over the irregular case of the grid example and finally over the mission planning of an Unmanned Aerial Vehicle (UAV), representing is a very irregular case. We demonstrate that the Dynamic Programming (DP) methods are more efficient methods than the Temporal-Difference (TD) method while facing an irregular set of actions.\",\"PeriodicalId\":423195,\"journal\":{\"name\":\"2021 International Conference on Unmanned Aircraft Systems (ICUAS)\",\"volume\":\"1 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2021-06-15\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"1\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2021 International Conference on Unmanned Aircraft Systems (ICUAS)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ICUAS51884.2021.9476691\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2021 International Conference on Unmanned Aircraft Systems (ICUAS)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICUAS51884.2021.9476691","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 1

摘要

21世纪已经取得了很多进步，尤其是机器人技术。今天，电子和计算能力的发展使我们能够开发出更精确、更快和自主的机器人。它们能够自动执行某些微妙或危险的任务。机器人应该在不确定的情况下，通过考虑任务目标来移动、感知环境并做出决策。马尔可夫决策过程(MDP)是目前用于任务描述和不确定情况下规划的最常用概率模型之一。此外，求解这些mdp的方法有三种基本类型:动态规划、蒙特卡罗方法和时间差分学习。每一类方法都有其优点和缺点。在本文中，我们比较了三种求解mdp的方法，即值迭代和策略迭代(动态规划法)和q学习(时间差分法)。给出了使决策方法适应应用问题的新准则，并给出了参数解释。对于复杂(和不规则)的场景，策略迭代是最有效的方法，对于简单(和规则)的场景，改进的Q-Learning是最有效的方法。因此，必须考虑决策的规律性，在执行时间方面选择最合适的解决方法。数值模拟结果显示了对简单规则情况下的网格，对不规则情况下的网格实例，最后对无人机的任务规划，代表了一个非常不规则的情况。我们证明了动态规划(DP)方法比时间差分(TD)方法在面对一组不规则动作时更有效。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Comparison of Value Iteration, Policy Iteration and Q-Learning for solving Decision-Making problems

21st century has seen a lot of progress, especially in robotics. Today, the evolution of electronics and computing capacities allows to develop more precise, faster and autonomous robots. They are able to automatically perform certain delicate or dangerous tasks. Robots should move, perceive their environment and make decisions by taking into account the goal(s) of a mission under uncertainty. One of the most current probabilistic model for description of missions and for planning under uncertainty is Markov Decision Process (MDP). In addition, there are three fundamental classes of methods for solving these MDPs: dynamic programming, Monte Carlo methods, and temporal difference learning. Each class of methods has its strengths and weaknesses. In this paper, we present our comparison on three methods for solving MDPs, Value Iteration and Policy Iteration (Dynamic Programming methods) and Q-Learning (Temporal-Difference method). We give new criteria to adapt the decision-making method to the application problem, with the parameters explanations. Policy Iteration is the most effective method for complex (and irregular) scenarios, and the modified Q-Learning for simple (and regular) scenarios. So, the regularity aspect of the decision-making has to be taken into account to choose the most appropriate resolution method in terms of execution time. Numerical simulation shows the conclusion results over simple and regular case of the grid, over the irregular case of the grid example and finally over the mission planning of an Unmanned Aerial Vehicle (UAV), representing is a very irregular case. We demonstrate that the Dynamic Programming (DP) methods are more efficient methods than the Temporal-Difference (TD) method while facing an irregular set of actions.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2021 International Conference on Unmanned Aircraft Systems (ICUAS)

自引率

0.00%

发文量