基于稀疏奖励致密化的失败强化学习鲁棒人机团队

IF 2 Q2 AUTOMATION & CONTROL SYSTEMS

IEEE Control Systems Letters Pub Date : 2025-07-21 DOI:10.1109/LCSYS.2025.3591199

Mingkang Wu;Yongcan Cao

{"title":"基于稀疏奖励致密化的失败强化学习鲁棒人机团队","authors":"Mingkang Wu;Yongcan Cao","doi":"10.1109/LCSYS.2025.3591199","DOIUrl":null,"url":null,"abstract":"Learning control policies in sparse reward environments is a challenging task for many robotic control tasks. The existing studies focus on designing reinforcement learning algorithms that take human inputs in the form of demonstrations such that control policies are learned via uncovering the value of these demonstrations. One typical approach is to learn an inherent reward function that can explain why demonstrations are better than other randomly generated samples. Albeit powerful, the use of human demonstrations is typically costly and difficult to collect, indicating the lack of robustness in these studies. To enhance robustness, we here propose to use failed experiences, namely, failure, due to the easiness of obtaining failure dataset, requiring only common sense rather than domain knowledge needed to generate expert demonstrations. In particular, this letter proposes a new reward densification technique that trains a discriminator to evaluate the similarity between the agent’s current behavior and failure dataset provided by humans. This reward densification technique provides an effective mechanism to obtain state-action values for environments with sparse rewards, via quantifying their (dis)similarity with failure. Additionally, the value of the current behavior, formulated as advantage function, is employed based on the densified reward to refine the control policy’s search direction. We finally conduct several experiments to demonstrate the effectiveness of the proposed approach by comparing with state-of-art methods.","PeriodicalId":37235,"journal":{"name":"IEEE Control Systems Letters","volume":"9 ","pages":"2315-2320"},"PeriodicalIF":2.0000,"publicationDate":"2025-07-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Robust Human-Machine Teaming Through Reinforcement Learning From Failure via Sparse Reward Densification\",\"authors\":\"Mingkang Wu;Yongcan Cao\",\"doi\":\"10.1109/LCSYS.2025.3591199\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Learning control policies in sparse reward environments is a challenging task for many robotic control tasks. The existing studies focus on designing reinforcement learning algorithms that take human inputs in the form of demonstrations such that control policies are learned via uncovering the value of these demonstrations. One typical approach is to learn an inherent reward function that can explain why demonstrations are better than other randomly generated samples. Albeit powerful, the use of human demonstrations is typically costly and difficult to collect, indicating the lack of robustness in these studies. To enhance robustness, we here propose to use failed experiences, namely, failure, due to the easiness of obtaining failure dataset, requiring only common sense rather than domain knowledge needed to generate expert demonstrations. In particular, this letter proposes a new reward densification technique that trains a discriminator to evaluate the similarity between the agent’s current behavior and failure dataset provided by humans. This reward densification technique provides an effective mechanism to obtain state-action values for environments with sparse rewards, via quantifying their (dis)similarity with failure. Additionally, the value of the current behavior, formulated as advantage function, is employed based on the densified reward to refine the control policy’s search direction. We finally conduct several experiments to demonstrate the effectiveness of the proposed approach by comparing with state-of-art methods.\",\"PeriodicalId\":37235,\"journal\":{\"name\":\"IEEE Control Systems Letters\",\"volume\":\"9 \",\"pages\":\"2315-2320\"},\"PeriodicalIF\":2.0000,\"publicationDate\":\"2025-07-21\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"IEEE Control Systems Letters\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://ieeexplore.ieee.org/document/11087555/\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"AUTOMATION & CONTROL SYSTEMS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Control Systems Letters","FirstCategoryId":"1085","ListUrlMain":"https://ieeexplore.ieee.org/document/11087555/","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"AUTOMATION & CONTROL SYSTEMS","Score":null,"Total":0}

引用次数: 0

摘要

对于许多机器人控制任务来说，稀疏奖励环境下的控制策略学习是一个具有挑战性的任务。现有的研究侧重于设计强化学习算法，该算法以演示的形式接受人类输入，从而通过揭示这些演示的价值来学习控制策略。一个典型的方法是学习一个内在的奖励函数，它可以解释为什么演示比其他随机生成的样本更好。尽管强大，但使用人体演示通常成本高昂且难以收集，这表明这些研究缺乏稳健性。为了增强鲁棒性，我们在这里建议使用失败经验，即失败，因为容易获得失败数据集，只需要常识而不是生成专家演示所需的领域知识。特别是，这封信提出了一种新的奖励致密化技术，该技术训练一个判别器来评估智能体当前行为和人类提供的失败数据集之间的相似性。这种奖励致密化技术提供了一种有效的机制，通过量化它们与失败的（非）相似性来获得具有稀疏奖励的环境的状态-行为值。此外，基于密集奖励，利用当前行为的值表示为优势函数，细化控制策略的搜索方向。最后，我们进行了几个实验，通过与最先进的方法进行比较，来证明所提出方法的有效性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Robust Human-Machine Teaming Through Reinforcement Learning From Failure via Sparse Reward Densification

Learning control policies in sparse reward environments is a challenging task for many robotic control tasks. The existing studies focus on designing reinforcement learning algorithms that take human inputs in the form of demonstrations such that control policies are learned via uncovering the value of these demonstrations. One typical approach is to learn an inherent reward function that can explain why demonstrations are better than other randomly generated samples. Albeit powerful, the use of human demonstrations is typically costly and difficult to collect, indicating the lack of robustness in these studies. To enhance robustness, we here propose to use failed experiences, namely, failure, due to the easiness of obtaining failure dataset, requiring only common sense rather than domain knowledge needed to generate expert demonstrations. In particular, this letter proposes a new reward densification technique that trains a discriminator to evaluate the similarity between the agent’s current behavior and failure dataset provided by humans. This reward densification technique provides an effective mechanism to obtain state-action values for environments with sparse rewards, via quantifying their (dis)similarity with failure. Additionally, the value of the current behavior, formulated as advantage function, is employed based on the densified reward to refine the control policy’s search direction. We finally conduct several experiments to demonstrate the effectiveness of the proposed approach by comparing with state-of-art methods.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

IEEE Control Systems Letters Mathematics-Control and Optimization

CiteScore

4.40

自引率

13.30%

发文量

471