Agent self-assessment: Determining policy quality without execution

2011 IEEE Symposium on Adaptive Dynamic Programming and Reinforcement Learning (ADPRL) Pub Date : 2011-04-11 DOI:10.1109/ADPRL.2011.5967358

A. Hans, S. Düll, S. Udluft

{"title":"Agent self-assessment: Determining policy quality without execution","authors":"A. Hans, S. Düll, S. Udluft","doi":"10.1109/ADPRL.2011.5967358","DOIUrl":null,"url":null,"abstract":"With the development of data-efficient reinforcement learning (RL) methods, a promising data-driven solution for optimal control of complex technical systems has become available. For the application of RL to a technical system, it is usually required to evaluate a policy before actually applying it to ensure it operates the system safely and within required performance bounds. In benchmark applications one can use the system dynamics directly to measure the policy quality. In real applications, however, this might be too expensive or even impossible. Being unable to evaluate the policy without using the actual system hinders the application of RL to autonomous controllers. As a first step toward agent self-assessment, we deal with discrete MDPs in this paper. We propose to use the value function along with its uncertainty to assess a policy's quality and show that, when dealing with an MDP estimated from observations, the value function itself can be misleading. We address this problem by determining the value function's uncertainty through uncertainty propagation and evaluate the approach using a number of benchmark applications.","PeriodicalId":406195,"journal":{"name":"2011 IEEE Symposium on Adaptive Dynamic Programming and Reinforcement Learning (ADPRL)","volume":"13 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2011-04-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"10","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2011 IEEE Symposium on Adaptive Dynamic Programming and Reinforcement Learning (ADPRL)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ADPRL.2011.5967358","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 10

Abstract

With the development of data-efficient reinforcement learning (RL) methods, a promising data-driven solution for optimal control of complex technical systems has become available. For the application of RL to a technical system, it is usually required to evaluate a policy before actually applying it to ensure it operates the system safely and within required performance bounds. In benchmark applications one can use the system dynamics directly to measure the policy quality. In real applications, however, this might be too expensive or even impossible. Being unable to evaluate the policy without using the actual system hinders the application of RL to autonomous controllers. As a first step toward agent self-assessment, we deal with discrete MDPs in this paper. We propose to use the value function along with its uncertainty to assess a policy's quality and show that, when dealing with an MDP estimated from observations, the value function itself can be misleading. We address this problem by determining the value function's uncertainty through uncertainty propagation and evaluate the approach using a number of benchmark applications.

查看原文本刊更多论文

代理自评估:在不执行的情况下确定策略质量

随着数据高效强化学习(RL)方法的发展，数据驱动的复杂技术系统最优控制解决方案已经成为可能。在将RL应用于技术系统时，通常需要在实际应用策略之前对其进行评估，以确保该策略在要求的性能范围内安全运行系统。在基准测试应用程序中，可以直接使用系统动态来度量策略质量。然而，在实际应用程序中，这可能太昂贵，甚至不可能。在不使用实际系统的情况下无法评估策略，这阻碍了RL在自主控制器中的应用。作为智能体自我评估的第一步，我们在本文中处理离散的mdp。我们建议使用价值函数及其不确定性来评估政策的质量，并表明，当处理从观察中估计的MDP时，价值函数本身可能具有误导性。我们通过不确定性传播确定值函数的不确定性来解决这个问题，并使用一些基准应用程序评估该方法。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2011 IEEE Symposium on Adaptive Dynamic Programming and Reinforcement Learning (ADPRL)

自引率

0.00%

发文量