{"title":"未折现不定视界mdp的离线极大极小q函数学习","authors":"Fengying Li, Yuqiang Li, Xianyi Wu, Wei Bai","doi":"10.1007/s10463-025-00924-1","DOIUrl":null,"url":null,"abstract":"<div><p>This work considers the offline evaluation problem for indefinite-horizon Markov Decision Processes. A minimax Q-function learning algorithm is proposed, which, instead of i.i.d. tuples <span>\\((s,a,s',r)\\)</span>, evaluates undiscounted expected return based by i.i.d. trajectories truncated at a given time step. The confidence error bounds are developed. Experiments using Open AI’s Cart Pole environment are employed to demonstrate the algorithm.</p></div>","PeriodicalId":55511,"journal":{"name":"Annals of the Institute of Statistical Mathematics","volume":"77 4","pages":"535 - 562"},"PeriodicalIF":0.6000,"publicationDate":"2025-04-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Offline minimax Q-function learning for undiscounted indefinite-horizon MDPs\",\"authors\":\"Fengying Li, Yuqiang Li, Xianyi Wu, Wei Bai\",\"doi\":\"10.1007/s10463-025-00924-1\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><p>This work considers the offline evaluation problem for indefinite-horizon Markov Decision Processes. A minimax Q-function learning algorithm is proposed, which, instead of i.i.d. tuples <span>\\\\((s,a,s',r)\\\\)</span>, evaluates undiscounted expected return based by i.i.d. trajectories truncated at a given time step. The confidence error bounds are developed. Experiments using Open AI’s Cart Pole environment are employed to demonstrate the algorithm.</p></div>\",\"PeriodicalId\":55511,\"journal\":{\"name\":\"Annals of the Institute of Statistical Mathematics\",\"volume\":\"77 4\",\"pages\":\"535 - 562\"},\"PeriodicalIF\":0.6000,\"publicationDate\":\"2025-04-21\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Annals of the Institute of Statistical Mathematics\",\"FirstCategoryId\":\"100\",\"ListUrlMain\":\"https://link.springer.com/article/10.1007/s10463-025-00924-1\",\"RegionNum\":4,\"RegionCategory\":\"数学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q3\",\"JCRName\":\"STATISTICS & PROBABILITY\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Annals of the Institute of Statistical Mathematics","FirstCategoryId":"100","ListUrlMain":"https://link.springer.com/article/10.1007/s10463-025-00924-1","RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"STATISTICS & PROBABILITY","Score":null,"Total":0}
Offline minimax Q-function learning for undiscounted indefinite-horizon MDPs
This work considers the offline evaluation problem for indefinite-horizon Markov Decision Processes. A minimax Q-function learning algorithm is proposed, which, instead of i.i.d. tuples \((s,a,s',r)\), evaluates undiscounted expected return based by i.i.d. trajectories truncated at a given time step. The confidence error bounds are developed. Experiments using Open AI’s Cart Pole environment are employed to demonstrate the algorithm.
期刊介绍:
Annals of the Institute of Statistical Mathematics (AISM) aims to provide a forum for open communication among statisticians, and to contribute to the advancement of statistics as a science to enable humans to handle information in order to cope with uncertainties. It publishes high-quality papers that shed new light on the theoretical, computational and/or methodological aspects of statistical science. Emphasis is placed on (a) development of new methodologies motivated by real data, (b) development of unifying theories, and (c) analysis and improvement of existing methodologies and theories.