Extended Abstract: Learning in Low-rank MDPs with Density Features

2023 57th Annual Conference on Information Sciences and Systems (CISS) Pub Date : 2023-03-22 DOI:10.1109/CISS56502.2023.10089731

Audrey Huang, Jinglin Chen, Nan Jiang

{"title":"Extended Abstract: Learning in Low-rank MDPs with Density Features","authors":"Audrey Huang, Jinglin Chen, Nan Jiang","doi":"10.1109/CISS56502.2023.10089731","DOIUrl":null,"url":null,"abstract":"In online reinforcement learning (RL) with large state spaces, MDPs with low-rank transitions-that is, the transition matrix can be factored into the product of two matrices, left and right-is a highly representative structure that enables tractable exploration. When given to the learner, the left matrix enables expressive function approximation for value-based learning, and this setting has been studied extensively (e.g., in linear MDPs). Similarly, the right matrix induces powerful models for state-occupancy densities. However, using such density features to learn in low-rank MDPs has never been studied to the best of our knowledge, and is a setting with interesting connections to leveraging the power of generative models in RL. In this work, we initiate the study of learning low-rank MDPs with density features. Our algorithm performs reward-free learning and builds an exploratory distribution in a level-by-level manner. It uses the density features for off-policy estimation of the policies' state distributions, and constructs the exploratory data by choosing the barycentric spanner of these distributions. From an analytical point of view, the additive error of distribution estimation is largely incompatible with the multiplicative definition of data coverage (e.g., concentrability). In the absence of strong assumptions like reachability, this incompatibility may lead to exponential or even infinite errors under standard analysis strategies, which we overcome via novel technical tools.","PeriodicalId":243775,"journal":{"name":"2023 57th Annual Conference on Information Sciences and Systems (CISS)","volume":"30 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-03-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2023 57th Annual Conference on Information Sciences and Systems (CISS)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/CISS56502.2023.10089731","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

In online reinforcement learning (RL) with large state spaces, MDPs with low-rank transitions-that is, the transition matrix can be factored into the product of two matrices, left and right-is a highly representative structure that enables tractable exploration. When given to the learner, the left matrix enables expressive function approximation for value-based learning, and this setting has been studied extensively (e.g., in linear MDPs). Similarly, the right matrix induces powerful models for state-occupancy densities. However, using such density features to learn in low-rank MDPs has never been studied to the best of our knowledge, and is a setting with interesting connections to leveraging the power of generative models in RL. In this work, we initiate the study of learning low-rank MDPs with density features. Our algorithm performs reward-free learning and builds an exploratory distribution in a level-by-level manner. It uses the density features for off-policy estimation of the policies' state distributions, and constructs the exploratory data by choosing the barycentric spanner of these distributions. From an analytical point of view, the additive error of distribution estimation is largely incompatible with the multiplicative definition of data coverage (e.g., concentrability). In the absence of strong assumptions like reachability, this incompatibility may lead to exponential or even infinite errors under standard analysis strategies, which we overcome via novel technical tools.

查看原文本刊更多论文

具有密度特征的低秩mdp的学习

在具有大状态空间的在线强化学习(RL)中，具有低秩转换的mdp(即转换矩阵可以分解为两个矩阵的乘积，左和右)是一种高度代表性的结构，可以进行易于处理的探索。当给予学习者时，左矩阵使基于值的学习的表达函数近似成为可能，并且这种设置已经被广泛研究(例如，在线性mdp中)。类似地，右矩阵引出了状态占用密度的强大模型。然而，据我们所知，使用这种密度特征在低秩mdp中进行学习从未被研究过，这是一个与强化学习中利用生成模型的能力有有趣联系的设置。在这项工作中，我们启动了具有密度特征的低秩mdp的学习研究。我们的算法执行无奖励学习，并以逐层的方式构建探索性分布。它利用密度特征对策略状态分布进行离策略估计，并通过选择这些分布的质心扳手来构造探索性数据。从分析的角度来看，分布估计的加性误差在很大程度上与数据覆盖的乘法定义(例如，集中度)不相容。在缺乏像可达性这样强有力的假设的情况下，这种不兼容性可能导致标准分析策略下的指数级甚至无限误差，我们通过新的技术工具来克服这些误差。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2023 57th Annual Conference on Information Sciences and Systems (CISS)

自引率

0.00%

发文量