{"title":"时间平均MDP中的最优策略是什么?","authors":"Nicolas Gast, Bruno Gaujal, Kimang Khun","doi":"10.1145/3626570.3626582","DOIUrl":null,"url":null,"abstract":"This paper discusses the notion of optimality for time-average MDPs. We argue that while most authors claim to use the \"average reward\" criteria, the notion that is implicitly used is in fact the notion of what we call Bellman optimality. We show that it does not coincide with other existing notions of optimality, like gain-optimality and bias-optimality but has strong connection with canonical-policies (policies that are optimal for any finite horizons) as well as value iteration and policy iterations algorithms.","PeriodicalId":35745,"journal":{"name":"Performance Evaluation Review","volume":"25 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-09-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"What is an Optimal Policy in Time-Average MDP?\",\"authors\":\"Nicolas Gast, Bruno Gaujal, Kimang Khun\",\"doi\":\"10.1145/3626570.3626582\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"This paper discusses the notion of optimality for time-average MDPs. We argue that while most authors claim to use the \\\"average reward\\\" criteria, the notion that is implicitly used is in fact the notion of what we call Bellman optimality. We show that it does not coincide with other existing notions of optimality, like gain-optimality and bias-optimality but has strong connection with canonical-policies (policies that are optimal for any finite horizons) as well as value iteration and policy iterations algorithms.\",\"PeriodicalId\":35745,\"journal\":{\"name\":\"Performance Evaluation Review\",\"volume\":\"25 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2023-09-28\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Performance Evaluation Review\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3626570.3626582\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q4\",\"JCRName\":\"Computer Science\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Performance Evaluation Review","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3626570.3626582","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q4","JCRName":"Computer Science","Score":null,"Total":0}
This paper discusses the notion of optimality for time-average MDPs. We argue that while most authors claim to use the "average reward" criteria, the notion that is implicitly used is in fact the notion of what we call Bellman optimality. We show that it does not coincide with other existing notions of optimality, like gain-optimality and bias-optimality but has strong connection with canonical-policies (policies that are optimal for any finite horizons) as well as value iteration and policy iterations algorithms.