Yagiz Savas, Melkior Ornik, Murat Cubuktepe, U. Topcu
{"title":"约束马尔可夫决策过程的熵最大化","authors":"Yagiz Savas, Melkior Ornik, Murat Cubuktepe, U. Topcu","doi":"10.1109/ALLERTON.2018.8636066","DOIUrl":null,"url":null,"abstract":"We study the problem of synthesizing a policy that maximizes the entropy of a Markov decision process (MDP) subject to expected reward constraints. Such a policy minimizes the predictability of the paths it generates in an MDP while attaining certain reward thresholds. We first show that the maximum entropy of an MDP can be finite, infinite or unbounded. We provide necessary and sufficient conditions under which the maximum entropy of an MDP is finite, infinite or unbounded. We then present an algorithm to synthesize a policy that maximizes the entropy of an MDP. The proposed algorithm is based on a convex optimization problem and runs in time polynomial in the size of the MDP. Finally, we extend the algorithm to an MDP subject to expected total reward constraints. In numerical examples, we demonstrate the proposed method on different motion planning scenarios and illustrate the trade-off between the predictability of paths and the level of the collected reward.","PeriodicalId":299280,"journal":{"name":"2018 56th Annual Allerton Conference on Communication, Control, and Computing (Allerton)","volume":"41 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2018-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"15","resultStr":"{\"title\":\"Entropy Maximization for Constrained Markov Decision Processes\",\"authors\":\"Yagiz Savas, Melkior Ornik, Murat Cubuktepe, U. Topcu\",\"doi\":\"10.1109/ALLERTON.2018.8636066\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"We study the problem of synthesizing a policy that maximizes the entropy of a Markov decision process (MDP) subject to expected reward constraints. Such a policy minimizes the predictability of the paths it generates in an MDP while attaining certain reward thresholds. We first show that the maximum entropy of an MDP can be finite, infinite or unbounded. We provide necessary and sufficient conditions under which the maximum entropy of an MDP is finite, infinite or unbounded. We then present an algorithm to synthesize a policy that maximizes the entropy of an MDP. The proposed algorithm is based on a convex optimization problem and runs in time polynomial in the size of the MDP. Finally, we extend the algorithm to an MDP subject to expected total reward constraints. In numerical examples, we demonstrate the proposed method on different motion planning scenarios and illustrate the trade-off between the predictability of paths and the level of the collected reward.\",\"PeriodicalId\":299280,\"journal\":{\"name\":\"2018 56th Annual Allerton Conference on Communication, Control, and Computing (Allerton)\",\"volume\":\"41 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2018-10-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"15\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2018 56th Annual Allerton Conference on Communication, Control, and Computing (Allerton)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ALLERTON.2018.8636066\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2018 56th Annual Allerton Conference on Communication, Control, and Computing (Allerton)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ALLERTON.2018.8636066","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Entropy Maximization for Constrained Markov Decision Processes
We study the problem of synthesizing a policy that maximizes the entropy of a Markov decision process (MDP) subject to expected reward constraints. Such a policy minimizes the predictability of the paths it generates in an MDP while attaining certain reward thresholds. We first show that the maximum entropy of an MDP can be finite, infinite or unbounded. We provide necessary and sufficient conditions under which the maximum entropy of an MDP is finite, infinite or unbounded. We then present an algorithm to synthesize a policy that maximizes the entropy of an MDP. The proposed algorithm is based on a convex optimization problem and runs in time polynomial in the size of the MDP. Finally, we extend the algorithm to an MDP subject to expected total reward constraints. In numerical examples, we demonstrate the proposed method on different motion planning scenarios and illustrate the trade-off between the predictability of paths and the level of the collected reward.