{"title":"q -学习中UCB策略的绩效调查","authors":"Koki Saito, A. Notsu, S. Ubukata, Katsuhiro Honda","doi":"10.1109/ICMLA.2015.59","DOIUrl":null,"url":null,"abstract":"In this paper, we investigated performance and usability of UCBQ algorithm proposed in previous research. This is the algorithm that UCB, which is one of bandit algorithms, is applied to Q-Learning, and can balance between exploitation and exploration. We confirmed in the previous research that it was able to realize effective learning in a partially observable Markov decision process by using a continuous state spaces shortest path problem. We numerically examined it by using a variety of simpler learning situation which is the 2 dimensional goal search problem in a Markov decision process, comparing to previous methods. As a result, we confirmed that it had a better performance than other methods.","PeriodicalId":288427,"journal":{"name":"2015 IEEE 14th International Conference on Machine Learning and Applications (ICMLA)","volume":"10 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2015-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"3","resultStr":"{\"title\":\"Performance Investigation of UCB Policy in Q-learning\",\"authors\":\"Koki Saito, A. Notsu, S. Ubukata, Katsuhiro Honda\",\"doi\":\"10.1109/ICMLA.2015.59\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"In this paper, we investigated performance and usability of UCBQ algorithm proposed in previous research. This is the algorithm that UCB, which is one of bandit algorithms, is applied to Q-Learning, and can balance between exploitation and exploration. We confirmed in the previous research that it was able to realize effective learning in a partially observable Markov decision process by using a continuous state spaces shortest path problem. We numerically examined it by using a variety of simpler learning situation which is the 2 dimensional goal search problem in a Markov decision process, comparing to previous methods. As a result, we confirmed that it had a better performance than other methods.\",\"PeriodicalId\":288427,\"journal\":{\"name\":\"2015 IEEE 14th International Conference on Machine Learning and Applications (ICMLA)\",\"volume\":\"10 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2015-12-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"3\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2015 IEEE 14th International Conference on Machine Learning and Applications (ICMLA)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ICMLA.2015.59\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2015 IEEE 14th International Conference on Machine Learning and Applications (ICMLA)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICMLA.2015.59","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Performance Investigation of UCB Policy in Q-learning
In this paper, we investigated performance and usability of UCBQ algorithm proposed in previous research. This is the algorithm that UCB, which is one of bandit algorithms, is applied to Q-Learning, and can balance between exploitation and exploration. We confirmed in the previous research that it was able to realize effective learning in a partially observable Markov decision process by using a continuous state spaces shortest path problem. We numerically examined it by using a variety of simpler learning situation which is the 2 dimensional goal search problem in a Markov decision process, comparing to previous methods. As a result, we confirmed that it had a better performance than other methods.