基于人类的技能发现：强化学习中具有偏好的可控多样性

IF 7.5 1区计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Expert Systems with Applications Pub Date : 2025-06-15 DOI:10.1016/j.eswa.2025.128604

Maxence Hussonnois , Thommen George Karimpanal , Mayank Shekhar Jha , Santu Rana

{"title":"基于人类的技能发现：强化学习中具有偏好的可控多样性","authors":"Maxence Hussonnois , Thommen George Karimpanal , Mayank Shekhar Jha , Santu Rana","doi":"10.1016/j.eswa.2025.128604","DOIUrl":null,"url":null,"abstract":"<div><div>Autonomously learning diverse behaviours without an extrinsic reward signal has been a problem of interest in reinforcement learning. However, the nature of learning in such mechanisms is unconstrained, often resulting in the accumulation of several unusable, unsafe or misaligned skills. In order to avoid such issues and to ensure the discovery of safe and human-aligned skills, it is necessary to incorporate humans into the unsupervised training process, which remains a largely unexplored topic. In this work, we propose Controlled Diversity with Preference (CDP)<span><math><msup><mrow></mrow><mrow><mn>1</mn><mo>,</mo><mn>2</mn></mrow></msup></math></span>, a novel, collaborative human-guided mechanism for an agent to learn a set of skills that is diverse as well as desirable. The key principle is to restrict the discovery of skills to regions that are deemed to be desirable as per a preference model trained using human preference labels on trajectory pairs. We evaluate our approach on 2D navigation and Mujoco environments and demonstrate the ability to discover diverse, yet desirable skills. We also provide principled guidelines for selecting suitable hyperparameter values along with comprehensive sensitivity analyses of the various factors influencing the performance of our approach.</div></div>","PeriodicalId":50461,"journal":{"name":"Expert Systems with Applications","volume":"292 ","pages":"Article 128604"},"PeriodicalIF":7.5000,"publicationDate":"2025-06-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Human-informed skill discovery: Controlled diversity with preference in reinforcement learning\",\"authors\":\"Maxence Hussonnois , Thommen George Karimpanal , Mayank Shekhar Jha , Santu Rana\",\"doi\":\"10.1016/j.eswa.2025.128604\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><div>Autonomously learning diverse behaviours without an extrinsic reward signal has been a problem of interest in reinforcement learning. However, the nature of learning in such mechanisms is unconstrained, often resulting in the accumulation of several unusable, unsafe or misaligned skills. In order to avoid such issues and to ensure the discovery of safe and human-aligned skills, it is necessary to incorporate humans into the unsupervised training process, which remains a largely unexplored topic. In this work, we propose Controlled Diversity with Preference (CDP)<span><math><msup><mrow></mrow><mrow><mn>1</mn><mo>,</mo><mn>2</mn></mrow></msup></math></span>, a novel, collaborative human-guided mechanism for an agent to learn a set of skills that is diverse as well as desirable. The key principle is to restrict the discovery of skills to regions that are deemed to be desirable as per a preference model trained using human preference labels on trajectory pairs. We evaluate our approach on 2D navigation and Mujoco environments and demonstrate the ability to discover diverse, yet desirable skills. We also provide principled guidelines for selecting suitable hyperparameter values along with comprehensive sensitivity analyses of the various factors influencing the performance of our approach.</div></div>\",\"PeriodicalId\":50461,\"journal\":{\"name\":\"Expert Systems with Applications\",\"volume\":\"292 \",\"pages\":\"Article 128604\"},\"PeriodicalIF\":7.5000,\"publicationDate\":\"2025-06-15\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Expert Systems with Applications\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S0957417425022237\",\"RegionNum\":1,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Expert Systems with Applications","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0957417425022237","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

摘要

在没有外部奖励信号的情况下自主学习各种行为一直是强化学习研究的热点问题。然而，在这种机制中学习的本质是不受约束的，往往导致一些不可用的、不安全的或不一致的技能的积累。为了避免这些问题，并确保发现安全和与人类一致的技能，有必要将人类纳入无监督训练过程，这在很大程度上仍然是一个未被探索的话题。在这项工作中，我们提出了控制偏好多样性（CDP）1,2，这是一种新颖的，协作的人类引导机制，用于智能体学习一组多样化且理想的技能。关键原则是将技能的发现限制在被认为是理想的区域，根据使用轨迹对上的人类偏好标签训练的偏好模型。我们在2D导航和Mujoco环境中评估了我们的方法，并展示了发现多样化但理想技能的能力。我们还提供了选择合适的超参数值的原则指南，以及影响我们方法性能的各种因素的综合敏感性分析。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Human-informed skill discovery: Controlled diversity with preference in reinforcement learning

Autonomously learning diverse behaviours without an extrinsic reward signal has been a problem of interest in reinforcement learning. However, the nature of learning in such mechanisms is unconstrained, often resulting in the accumulation of several unusable, unsafe or misaligned skills. In order to avoid such issues and to ensure the discovery of safe and human-aligned skills, it is necessary to incorporate humans into the unsupervised training process, which remains a largely unexplored topic. In this work, we propose Controlled Diversity with Preference (CDP)

^{1, 2}

, a novel, collaborative human-guided mechanism for an agent to learn a set of skills that is diverse as well as desirable. The key principle is to restrict the discovery of skills to regions that are deemed to be desirable as per a preference model trained using human preference labels on trajectory pairs. We evaluate our approach on 2D navigation and Mujoco environments and demonstrate the ability to discover diverse, yet desirable skills. We also provide principled guidelines for selecting suitable hyperparameter values along with comprehensive sensitivity analyses of the various factors influencing the performance of our approach.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Expert Systems with Applications 工程技术-工程：电子与电气

CiteScore

13.80

自引率

10.60%

发文量

2045

审稿时长

8.7 months

期刊介绍： Expert Systems With Applications is an international journal dedicated to the exchange of information on expert and intelligent systems used globally in industry, government, and universities. The journal emphasizes original papers covering the design, development, testing, implementation, and management of these systems, offering practical guidelines. It spans various sectors such as finance, engineering, marketing, law, project management, information management, medicine, and more. The journal also welcomes papers on multi-agent systems, knowledge management, neural networks, knowledge discovery, data mining, and other related areas, excluding applications to military/defense systems.