基于计划的强化学习奖励塑造

2008 4th International IEEE Conference Intelligent Systems Pub Date : 2008-11-11 DOI:10.1109/IS.2008.4670492

M. Grzes, D. Kudenko

{"title":"基于计划的强化学习奖励塑造","authors":"M. Grzes, D. Kudenko","doi":"10.1109/IS.2008.4670492","DOIUrl":null,"url":null,"abstract":"Reinforcement learning, while being a highly popular learning technique for agents and multi-agent systems, has so far encountered difficulties when applying it to more complex domains due to scaling-up problems. This paper focuses on the use of domain knowledge to improve the convergence speed and optimality of various RL techniques. Specifically, we propose the use of high-level STRIPS operator knowledge in reward shaping to focus the search for the optimal policy. Empirical results show that the plan-based reward shaping approach outperforms other RL techniques, including alternative manual and MDP-based reward shaping when it is used in its basic form. We show that MDP-based reward shaping may fail and successful experiments with STRIPS-based shaping suggest modifications which can overcome encountered problems. The STRIPS-based method we propose allows expressing the same domain knowledge in a different way and the domain expert can choose whether to define an MDP or STRIPS planning task. We also evaluate the robustness of the proposed STRIPS-based technique to errors in the plan knowledge.","PeriodicalId":305750,"journal":{"name":"2008 4th International IEEE Conference Intelligent Systems","volume":"10 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2008-11-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"79","resultStr":"{\"title\":\"Plan-based reward shaping for reinforcement learning\",\"authors\":\"M. Grzes, D. Kudenko\",\"doi\":\"10.1109/IS.2008.4670492\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Reinforcement learning, while being a highly popular learning technique for agents and multi-agent systems, has so far encountered difficulties when applying it to more complex domains due to scaling-up problems. This paper focuses on the use of domain knowledge to improve the convergence speed and optimality of various RL techniques. Specifically, we propose the use of high-level STRIPS operator knowledge in reward shaping to focus the search for the optimal policy. Empirical results show that the plan-based reward shaping approach outperforms other RL techniques, including alternative manual and MDP-based reward shaping when it is used in its basic form. We show that MDP-based reward shaping may fail and successful experiments with STRIPS-based shaping suggest modifications which can overcome encountered problems. The STRIPS-based method we propose allows expressing the same domain knowledge in a different way and the domain expert can choose whether to define an MDP or STRIPS planning task. We also evaluate the robustness of the proposed STRIPS-based technique to errors in the plan knowledge.\",\"PeriodicalId\":305750,\"journal\":{\"name\":\"2008 4th International IEEE Conference Intelligent Systems\",\"volume\":\"10 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2008-11-11\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"79\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2008 4th International IEEE Conference Intelligent Systems\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/IS.2008.4670492\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2008 4th International IEEE Conference Intelligent Systems","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/IS.2008.4670492","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 79

摘要

强化学习虽然是智能体和多智能体系统中非常流行的一种学习技术，但由于规模问题，迄今为止在将其应用于更复杂的领域时遇到了困难。本文的重点是利用领域知识来提高各种强化学习技术的收敛速度和最优性。具体来说，我们建议在奖励形成中使用高级条带算子知识来集中搜索最优策略。实证结果表明，当以基本形式使用时，基于计划的奖励形成方法优于其他强化学习技术，包括替代手动和基于mdp的奖励形成。我们发现基于mdp的奖励形成可能会失败，而基于strips的奖励形成的成功实验表明，可以克服遇到的问题进行修改。我们提出的基于STRIPS的方法允许以不同的方式表达相同的领域知识，领域专家可以选择是定义MDP还是strip规划任务。我们还评估了所提出的基于条带的技术对计划知识错误的鲁棒性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Plan-based reward shaping for reinforcement learning

Reinforcement learning, while being a highly popular learning technique for agents and multi-agent systems, has so far encountered difficulties when applying it to more complex domains due to scaling-up problems. This paper focuses on the use of domain knowledge to improve the convergence speed and optimality of various RL techniques. Specifically, we propose the use of high-level STRIPS operator knowledge in reward shaping to focus the search for the optimal policy. Empirical results show that the plan-based reward shaping approach outperforms other RL techniques, including alternative manual and MDP-based reward shaping when it is used in its basic form. We show that MDP-based reward shaping may fail and successful experiments with STRIPS-based shaping suggest modifications which can overcome encountered problems. The STRIPS-based method we propose allows expressing the same domain knowledge in a different way and the domain expert can choose whether to define an MDP or STRIPS planning task. We also evaluate the robustness of the proposed STRIPS-based technique to errors in the plan knowledge.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2008 4th International IEEE Conference Intelligent Systems

自引率

0.00%

发文量