{"title":"基于事件优化的部分可观察马尔可夫决策过程问题的一个特例","authors":"Junyu Zhang","doi":"10.1109/ICIT.2016.7474986","DOIUrl":null,"url":null,"abstract":"In this paper, we discuss a kind of partially observable Markov decision process (POMDP) problem by the event-based optimization which is proposed in [4]. A POMDP ([7] and [8]) is a generalization of a standard completely observable Markov decision process that allows imperfect information about states of the system. Policy iteration algorithms for POMDPs have proved to be impractical as it is very difficult to implement. Thus, most work with POMDPs has used value iteration. But for a special case of POMDP, we can formulate it to an MDP problem. Then we can use our sensitivity view to derive the corresponding average reward difference formula. Based on that and the idea of event-based optimization, we use a single sample path to estimate aggregated potentials. Then we develop policy iteration (PI) algorithms.","PeriodicalId":116715,"journal":{"name":"2016 IEEE International Conference on Industrial Technology (ICIT)","volume":"43 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2016-03-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":"{\"title\":\"A special case of partially observable Markov decision processes problem by event-based optimization\",\"authors\":\"Junyu Zhang\",\"doi\":\"10.1109/ICIT.2016.7474986\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"In this paper, we discuss a kind of partially observable Markov decision process (POMDP) problem by the event-based optimization which is proposed in [4]. A POMDP ([7] and [8]) is a generalization of a standard completely observable Markov decision process that allows imperfect information about states of the system. Policy iteration algorithms for POMDPs have proved to be impractical as it is very difficult to implement. Thus, most work with POMDPs has used value iteration. But for a special case of POMDP, we can formulate it to an MDP problem. Then we can use our sensitivity view to derive the corresponding average reward difference formula. Based on that and the idea of event-based optimization, we use a single sample path to estimate aggregated potentials. Then we develop policy iteration (PI) algorithms.\",\"PeriodicalId\":116715,\"journal\":{\"name\":\"2016 IEEE International Conference on Industrial Technology (ICIT)\",\"volume\":\"43 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2016-03-14\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"1\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2016 IEEE International Conference on Industrial Technology (ICIT)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ICIT.2016.7474986\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2016 IEEE International Conference on Industrial Technology (ICIT)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICIT.2016.7474986","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
A special case of partially observable Markov decision processes problem by event-based optimization
In this paper, we discuss a kind of partially observable Markov decision process (POMDP) problem by the event-based optimization which is proposed in [4]. A POMDP ([7] and [8]) is a generalization of a standard completely observable Markov decision process that allows imperfect information about states of the system. Policy iteration algorithms for POMDPs have proved to be impractical as it is very difficult to implement. Thus, most work with POMDPs has used value iteration. But for a special case of POMDP, we can formulate it to an MDP problem. Then we can use our sensitivity view to derive the corresponding average reward difference formula. Based on that and the idea of event-based optimization, we use a single sample path to estimate aggregated potentials. Then we develop policy iteration (PI) algorithms.