Ensemble successor representations for task generalization in offline-to-online reinforcement learning

IF 7.6 2区计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS

Science China Information Sciences Pub Date : 2024-06-25 DOI:10.1007/s11432-023-4028-1

Changhong Wang, Xudong Yu, Chenjia Bai, Qiaosheng Zhang, Zhen Wang

{"title":"Ensemble successor representations for task generalization in offline-to-online reinforcement learning","authors":"Changhong Wang, Xudong Yu, Chenjia Bai, Qiaosheng Zhang, Zhen Wang","doi":"10.1007/s11432-023-4028-1","DOIUrl":null,"url":null,"abstract":"In reinforcement learning (RL), training a policy from scratch with online experiences can be inefficient because of the difficulties in exploration. Recently, offline RL provides a promising solution by giving an initialized offline policy, which can be refined through online interactions. However, existing approaches primarily perform offline and online learning in the same task, without considering the task generalization problem in offline-to-online adaptation. In real-world applications, it is common that we only have an offline dataset from a specific task while aiming for fast online-adaptation for several tasks. To address this problem, our work builds upon the investigation of successor representations for task generalization in online RL and extends the framework to incorporate offline-to-online learning. We demonstrate that the conventional paradigm using successor features cannot effectively utilize offline data and improve the performance for the new task by online fine-tuning. To mitigate this, we introduce a novel methodology that leverages offline data to acquire an ensemble of successor representations and subsequently constructs ensemble Q functions. This approach enables robust representation learning from datasets with different coverage and facilitates fast adaption of Q functions towards new tasks during the online fine-tuning phase. Extensive empirical evaluations provide compelling evidence showcasing the superior performance of our method in generalizing to diverse or even unseen tasks.","PeriodicalId":21618,"journal":{"name":"Science China Information Sciences","volume":"75 1","pages":""},"PeriodicalIF":7.6000,"publicationDate":"2024-06-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Science China Information Sciences","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1007/s11432-023-4028-1","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}

引用次数: 0

Abstract

In reinforcement learning (RL), training a policy from scratch with online experiences can be inefficient because of the difficulties in exploration. Recently, offline RL provides a promising solution by giving an initialized offline policy, which can be refined through online interactions. However, existing approaches primarily perform offline and online learning in the same task, without considering the task generalization problem in offline-to-online adaptation. In real-world applications, it is common that we only have an offline dataset from a specific task while aiming for fast online-adaptation for several tasks. To address this problem, our work builds upon the investigation of successor representations for task generalization in online RL and extends the framework to incorporate offline-to-online learning. We demonstrate that the conventional paradigm using successor features cannot effectively utilize offline data and improve the performance for the new task by online fine-tuning. To mitigate this, we introduce a novel methodology that leverages offline data to acquire an ensemble of successor representations and subsequently constructs ensemble Q functions. This approach enables robust representation learning from datasets with different coverage and facilitates fast adaption of Q functions towards new tasks during the online fine-tuning phase. Extensive empirical evaluations provide compelling evidence showcasing the superior performance of our method in generalizing to diverse or even unseen tasks.

查看原文本刊更多论文

离线到在线强化学习中任务泛化的集合后继表征

在强化学习（RL）中，由于探索困难，利用在线经验从头开始训练策略的效率很低。最近，离线强化学习提供了一种很有前景的解决方案，即给出一个初始化的离线策略，然后通过在线交互对其进行完善。然而，现有的方法主要是在同一任务中执行离线和在线学习，而没有考虑离线到在线适应过程中的任务泛化问题。在现实世界的应用中，我们通常只有一个特定任务的离线数据集，而目标是对多个任务进行快速在线适应。为了解决这个问题，我们的研究以在线 RL 中任务泛化的后继表征研究为基础，并扩展了框架，将离线到在线学习纳入其中。我们证明，使用后继特征的传统范式无法有效利用离线数据，也无法通过在线微调提高新任务的性能。为了缓解这一问题，我们引入了一种新方法，利用离线数据获取后继表征集合，然后构建集合 Q 函数。这种方法能从不同覆盖率的数据集中实现稳健的表征学习，并在在线微调阶段促进 Q 函数对新任务的快速适应。广泛的实证评估提供了令人信服的证据，展示了我们的方法在泛化到不同甚至未见过的任务方面的卓越性能。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Science China Information Sciences COMPUTER SCIENCE, INFORMATION SYSTEMS-

CiteScore

12.60

自引率

5.70%

发文量

224

审稿时长

8.3 months

期刊介绍： Science China Information Sciences is a dedicated journal that showcases high-quality, original research across various domains of information sciences. It encompasses Computer Science & Technologies, Control Science & Engineering, Information & Communication Engineering, Microelectronics & Solid-State Electronics, and Quantum Information, providing a platform for the dissemination of significant contributions in these fields.