{"title":"价值迭代、策略迭代和q -学习在解决决策问题中的比较","authors":"M. Hamadouche, C. Dezan, D. Espès, K. Branco","doi":"10.1109/ICUAS51884.2021.9476691","DOIUrl":null,"url":null,"abstract":"21st century has seen a lot of progress, especially in robotics. Today, the evolution of electronics and computing capacities allows to develop more precise, faster and autonomous robots. They are able to automatically perform certain delicate or dangerous tasks. Robots should move, perceive their environment and make decisions by taking into account the goal(s) of a mission under uncertainty. One of the most current probabilistic model for description of missions and for planning under uncertainty is Markov Decision Process (MDP). In addition, there are three fundamental classes of methods for solving these MDPs: dynamic programming, Monte Carlo methods, and temporal difference learning. Each class of methods has its strengths and weaknesses. In this paper, we present our comparison on three methods for solving MDPs, Value Iteration and Policy Iteration (Dynamic Programming methods) and Q-Learning (Temporal-Difference method). We give new criteria to adapt the decision-making method to the application problem, with the parameters explanations. Policy Iteration is the most effective method for complex (and irregular) scenarios, and the modified Q-Learning for simple (and regular) scenarios. So, the regularity aspect of the decision-making has to be taken into account to choose the most appropriate resolution method in terms of execution time. Numerical simulation shows the conclusion results over simple and regular case of the grid, over the irregular case of the grid example and finally over the mission planning of an Unmanned Aerial Vehicle (UAV), representing is a very irregular case. We demonstrate that the Dynamic Programming (DP) methods are more efficient methods than the Temporal-Difference (TD) method while facing an irregular set of actions.","PeriodicalId":423195,"journal":{"name":"2021 International Conference on Unmanned Aircraft Systems (ICUAS)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-06-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":"{\"title\":\"Comparison of Value Iteration, Policy Iteration and Q-Learning for solving Decision-Making problems\",\"authors\":\"M. Hamadouche, C. Dezan, D. Espès, K. Branco\",\"doi\":\"10.1109/ICUAS51884.2021.9476691\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"21st century has seen a lot of progress, especially in robotics. Today, the evolution of electronics and computing capacities allows to develop more precise, faster and autonomous robots. They are able to automatically perform certain delicate or dangerous tasks. Robots should move, perceive their environment and make decisions by taking into account the goal(s) of a mission under uncertainty. One of the most current probabilistic model for description of missions and for planning under uncertainty is Markov Decision Process (MDP). In addition, there are three fundamental classes of methods for solving these MDPs: dynamic programming, Monte Carlo methods, and temporal difference learning. Each class of methods has its strengths and weaknesses. In this paper, we present our comparison on three methods for solving MDPs, Value Iteration and Policy Iteration (Dynamic Programming methods) and Q-Learning (Temporal-Difference method). We give new criteria to adapt the decision-making method to the application problem, with the parameters explanations. Policy Iteration is the most effective method for complex (and irregular) scenarios, and the modified Q-Learning for simple (and regular) scenarios. So, the regularity aspect of the decision-making has to be taken into account to choose the most appropriate resolution method in terms of execution time. Numerical simulation shows the conclusion results over simple and regular case of the grid, over the irregular case of the grid example and finally over the mission planning of an Unmanned Aerial Vehicle (UAV), representing is a very irregular case. We demonstrate that the Dynamic Programming (DP) methods are more efficient methods than the Temporal-Difference (TD) method while facing an irregular set of actions.\",\"PeriodicalId\":423195,\"journal\":{\"name\":\"2021 International Conference on Unmanned Aircraft Systems (ICUAS)\",\"volume\":\"1 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2021-06-15\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"1\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2021 International Conference on Unmanned Aircraft Systems (ICUAS)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ICUAS51884.2021.9476691\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2021 International Conference on Unmanned Aircraft Systems (ICUAS)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICUAS51884.2021.9476691","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Comparison of Value Iteration, Policy Iteration and Q-Learning for solving Decision-Making problems
21st century has seen a lot of progress, especially in robotics. Today, the evolution of electronics and computing capacities allows to develop more precise, faster and autonomous robots. They are able to automatically perform certain delicate or dangerous tasks. Robots should move, perceive their environment and make decisions by taking into account the goal(s) of a mission under uncertainty. One of the most current probabilistic model for description of missions and for planning under uncertainty is Markov Decision Process (MDP). In addition, there are three fundamental classes of methods for solving these MDPs: dynamic programming, Monte Carlo methods, and temporal difference learning. Each class of methods has its strengths and weaknesses. In this paper, we present our comparison on three methods for solving MDPs, Value Iteration and Policy Iteration (Dynamic Programming methods) and Q-Learning (Temporal-Difference method). We give new criteria to adapt the decision-making method to the application problem, with the parameters explanations. Policy Iteration is the most effective method for complex (and irregular) scenarios, and the modified Q-Learning for simple (and regular) scenarios. So, the regularity aspect of the decision-making has to be taken into account to choose the most appropriate resolution method in terms of execution time. Numerical simulation shows the conclusion results over simple and regular case of the grid, over the irregular case of the grid example and finally over the mission planning of an Unmanned Aerial Vehicle (UAV), representing is a very irregular case. We demonstrate that the Dynamic Programming (DP) methods are more efficient methods than the Temporal-Difference (TD) method while facing an irregular set of actions.