{"title":"连续时间隐马尔可夫决策过程策略梯度估计","authors":"Liao Yanjie, Yin Bao-qun, Xi Hongsheng","doi":"10.1109/ICIA.2005.1635101","DOIUrl":null,"url":null,"abstract":"Recently, gradient based methods have received much attention to optimize some dynamic systems with hidden information, such as routing problems of robotic systems. In this paper, we presented a process - continuous time hidden Markov decision process (CTHMDP), which can be used to model the robotic systems. For this process, the problem of policy gradient estimation is studied. Firstly, an approximation formula to the gradient is presented, then by using the uniformization method, we introduce an algorithm, which can be considered as an extension of gradient of partially observable Markov decision process (GPOMDP) algorithm to the continue time model. Finally, the convergence and error bound of the algorithm are considered.","PeriodicalId":136611,"journal":{"name":"2005 IEEE International Conference on Information Acquisition","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"1900-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":"{\"title\":\"The policy gradient estimation of continuous-time hidden Markov decision processes\",\"authors\":\"Liao Yanjie, Yin Bao-qun, Xi Hongsheng\",\"doi\":\"10.1109/ICIA.2005.1635101\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Recently, gradient based methods have received much attention to optimize some dynamic systems with hidden information, such as routing problems of robotic systems. In this paper, we presented a process - continuous time hidden Markov decision process (CTHMDP), which can be used to model the robotic systems. For this process, the problem of policy gradient estimation is studied. Firstly, an approximation formula to the gradient is presented, then by using the uniformization method, we introduce an algorithm, which can be considered as an extension of gradient of partially observable Markov decision process (GPOMDP) algorithm to the continue time model. Finally, the convergence and error bound of the algorithm are considered.\",\"PeriodicalId\":136611,\"journal\":{\"name\":\"2005 IEEE International Conference on Information Acquisition\",\"volume\":\"1 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"1900-01-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"1\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2005 IEEE International Conference on Information Acquisition\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ICIA.2005.1635101\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2005 IEEE International Conference on Information Acquisition","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICIA.2005.1635101","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
The policy gradient estimation of continuous-time hidden Markov decision processes
Recently, gradient based methods have received much attention to optimize some dynamic systems with hidden information, such as routing problems of robotic systems. In this paper, we presented a process - continuous time hidden Markov decision process (CTHMDP), which can be used to model the robotic systems. For this process, the problem of policy gradient estimation is studied. Firstly, an approximation formula to the gradient is presented, then by using the uniformization method, we introduce an algorithm, which can be considered as an extension of gradient of partially observable Markov decision process (GPOMDP) algorithm to the continue time model. Finally, the convergence and error bound of the algorithm are considered.