基于占用信息比的稀疏奖励环境下信息导向策略搜索

2023 57th Annual Conference on Information Sciences and Systems (CISS) Pub Date : 2023-03-22 DOI:10.1109/CISS56502.2023.10089655

Wesley A. Suttle, Alec Koppel, Ji Liu

{"title":"基于占用信息比的稀疏奖励环境下信息导向策略搜索","authors":"Wesley A. Suttle, Alec Koppel, Ji Liu","doi":"10.1109/CISS56502.2023.10089655","DOIUrl":null,"url":null,"abstract":"This paper examines a new measure of the exploration/exploitation trade-off in reinforcement learning (RL) called the occupancy information ratio (OIR). To this end, the paper derives the Information-Directed Actor-Critic (IDAC) algorithm for solving the OIR problem, provides an overview of the rich theory underlying IDAC and related OIR policy gradient methods, and experimentally investigates the advantages of such methods. The central contribution of this paper is to provide empirical evidence that, due to the form of the OIR objective, IDAC enjoys superior performance over vanilla RL methods in sparse-reward environments.","PeriodicalId":243775,"journal":{"name":"2023 57th Annual Conference on Information Sciences and Systems (CISS)","volume":"213 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-03-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Information-Directed Policy Search in Sparse-Reward Settings via the Occupancy Information Ratio\",\"authors\":\"Wesley A. Suttle, Alec Koppel, Ji Liu\",\"doi\":\"10.1109/CISS56502.2023.10089655\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"This paper examines a new measure of the exploration/exploitation trade-off in reinforcement learning (RL) called the occupancy information ratio (OIR). To this end, the paper derives the Information-Directed Actor-Critic (IDAC) algorithm for solving the OIR problem, provides an overview of the rich theory underlying IDAC and related OIR policy gradient methods, and experimentally investigates the advantages of such methods. The central contribution of this paper is to provide empirical evidence that, due to the form of the OIR objective, IDAC enjoys superior performance over vanilla RL methods in sparse-reward environments.\",\"PeriodicalId\":243775,\"journal\":{\"name\":\"2023 57th Annual Conference on Information Sciences and Systems (CISS)\",\"volume\":\"213 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2023-03-22\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2023 57th Annual Conference on Information Sciences and Systems (CISS)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/CISS56502.2023.10089655\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2023 57th Annual Conference on Information Sciences and Systems (CISS)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/CISS56502.2023.10089655","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

本文研究了强化学习(RL)中探索/利用权衡的一种新度量，称为占用信息比(OIR)。为此，本文推导了用于解决OIR问题的信息导向行为者-批评家(Information-Directed Actor-Critic, IDAC)算法，概述了IDAC的丰富理论基础和相关的OIR策略梯度方法，并实验研究了这些方法的优点。本文的核心贡献是提供了经验证据，证明由于OIR目标的形式，IDAC在稀疏奖励环境中比普通RL方法具有更好的性能。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Information-Directed Policy Search in Sparse-Reward Settings via the Occupancy Information Ratio

This paper examines a new measure of the exploration/exploitation trade-off in reinforcement learning (RL) called the occupancy information ratio (OIR). To this end, the paper derives the Information-Directed Actor-Critic (IDAC) algorithm for solving the OIR problem, provides an overview of the rich theory underlying IDAC and related OIR policy gradient methods, and experimentally investigates the advantages of such methods. The central contribution of this paper is to provide empirical evidence that, due to the form of the OIR objective, IDAC enjoys superior performance over vanilla RL methods in sparse-reward environments.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2023 57th Annual Conference on Information Sciences and Systems (CISS)

自引率

0.00%

发文量