A criterion for selecting the appropriate one from the trained models for model-based offline policy evaluation

IF 7.3 2区计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

CAAI Transactions on Intelligence Technology Pub Date : 2024-10-09 DOI:10.1049/cit2.12376

Chongchong Li, Yue Wang, Zhi-Ming Ma, Yuting Liu

{"title":"A criterion for selecting the appropriate one from the trained models for model-based offline policy evaluation","authors":"Chongchong Li, Yue Wang, Zhi-Ming Ma, Yuting Liu","doi":"10.1049/cit2.12376","DOIUrl":null,"url":null,"abstract":"<p>Offline policy evaluation, evaluating and selecting complex policies for decision-making by only using offline datasets is important in reinforcement learning. At present, the model-based offline policy evaluation (MBOPE) is widely welcomed because of its easy to implement and good performance. MBOPE directly approximates the unknown value of a given policy using the Monte Carlo method given the estimated transition and reward functions of the environment. Usually, multiple models are trained, and then one of them is selected to be used. However, a challenge remains in selecting an appropriate model from those trained for further use. The authors first analyse the upper bound of the difference between the approximated value and the unknown true value. Theoretical results show that this difference is related to the trajectories generated by the given policy on the learnt model and the prediction error of the transition and reward functions at these generated data points. Based on the theoretical results, a new criterion is proposed to tell which trained model is better suited for evaluating the given policy. At last, the effectiveness of the proposed criterion is demonstrated on both benchmark and synthetic offline datasets.</p>","PeriodicalId":46211,"journal":{"name":"CAAI Transactions on Intelligence Technology","volume":"10 1","pages":"223-234"},"PeriodicalIF":7.3000,"publicationDate":"2024-10-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1049/cit2.12376","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"CAAI Transactions on Intelligence Technology","FirstCategoryId":"94","ListUrlMain":"https://ietresearch.onlinelibrary.wiley.com/doi/10.1049/cit2.12376","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

Abstract

Offline policy evaluation, evaluating and selecting complex policies for decision-making by only using offline datasets is important in reinforcement learning. At present, the model-based offline policy evaluation (MBOPE) is widely welcomed because of its easy to implement and good performance. MBOPE directly approximates the unknown value of a given policy using the Monte Carlo method given the estimated transition and reward functions of the environment. Usually, multiple models are trained, and then one of them is selected to be used. However, a challenge remains in selecting an appropriate model from those trained for further use. The authors first analyse the upper bound of the difference between the approximated value and the unknown true value. Theoretical results show that this difference is related to the trajectories generated by the given policy on the learnt model and the prediction error of the transition and reward functions at these generated data points. Based on the theoretical results, a new criterion is proposed to tell which trained model is better suited for evaluating the given policy. At last, the effectiveness of the proposed criterion is demonstrated on both benchmark and synthetic offline datasets.

Abstract Image

查看原文本刊更多论文

为基于模型的离线策略评估从训练模型中选择合适模型的标准

离线策略评估，仅使用离线数据集评估和选择复杂的决策策略在强化学习中很重要。目前，基于模型的离线策略评估（mope）因其易于实现和性能好而受到广泛欢迎。mope在给定环境的估计过渡函数和奖励函数的情况下，使用蒙特卡罗方法直接逼近给定策略的未知值。通常，训练多个模型，然后选择其中一个模型进行使用。然而，从那些训练过的模型中选择合适的模型以供进一步使用仍然是一个挑战。作者首先分析了逼近值与未知真值之差的上界。理论结果表明，这种差异与给定策略在学习模型上产生的轨迹以及在这些生成的数据点上的过渡函数和奖励函数的预测误差有关。在理论结果的基础上，提出了一个新的准则来判断哪个训练模型更适合评估给定的策略。最后，在基准数据集和综合离线数据集上验证了该准则的有效性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

CAAI Transactions on Intelligence Technology COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE-

CiteScore

11.00

自引率

3.90%

发文量

134

审稿时长

35 weeks

期刊介绍： CAAI Transactions on Intelligence Technology is a leading venue for original research on the theoretical and experimental aspects of artificial intelligence technology. We are a fully open access journal co-published by the Institution of Engineering and Technology (IET) and the Chinese Association for Artificial Intelligence (CAAI) providing research which is openly accessible to read and share worldwide.