{"title":"基于稀疏奖励致密化的失败强化学习鲁棒人机团队","authors":"Mingkang Wu;Yongcan Cao","doi":"10.1109/LCSYS.2025.3591199","DOIUrl":null,"url":null,"abstract":"Learning control policies in sparse reward environments is a challenging task for many robotic control tasks. The existing studies focus on designing reinforcement learning algorithms that take human inputs in the form of demonstrations such that control policies are learned via uncovering the value of these demonstrations. One typical approach is to learn an inherent reward function that can explain why demonstrations are better than other randomly generated samples. Albeit powerful, the use of human demonstrations is typically costly and difficult to collect, indicating the lack of robustness in these studies. To enhance robustness, we here propose to use failed experiences, namely, failure, due to the easiness of obtaining failure dataset, requiring only common sense rather than domain knowledge needed to generate expert demonstrations. In particular, this letter proposes a new reward densification technique that trains a discriminator to evaluate the similarity between the agent’s current behavior and failure dataset provided by humans. This reward densification technique provides an effective mechanism to obtain state-action values for environments with sparse rewards, via quantifying their (dis)similarity with failure. Additionally, the value of the current behavior, formulated as advantage function, is employed based on the densified reward to refine the control policy’s search direction. We finally conduct several experiments to demonstrate the effectiveness of the proposed approach by comparing with state-of-art methods.","PeriodicalId":37235,"journal":{"name":"IEEE Control Systems Letters","volume":"9 ","pages":"2315-2320"},"PeriodicalIF":2.0000,"publicationDate":"2025-07-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Robust Human-Machine Teaming Through Reinforcement Learning From Failure via Sparse Reward Densification\",\"authors\":\"Mingkang Wu;Yongcan Cao\",\"doi\":\"10.1109/LCSYS.2025.3591199\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Learning control policies in sparse reward environments is a challenging task for many robotic control tasks. The existing studies focus on designing reinforcement learning algorithms that take human inputs in the form of demonstrations such that control policies are learned via uncovering the value of these demonstrations. One typical approach is to learn an inherent reward function that can explain why demonstrations are better than other randomly generated samples. Albeit powerful, the use of human demonstrations is typically costly and difficult to collect, indicating the lack of robustness in these studies. To enhance robustness, we here propose to use failed experiences, namely, failure, due to the easiness of obtaining failure dataset, requiring only common sense rather than domain knowledge needed to generate expert demonstrations. In particular, this letter proposes a new reward densification technique that trains a discriminator to evaluate the similarity between the agent’s current behavior and failure dataset provided by humans. This reward densification technique provides an effective mechanism to obtain state-action values for environments with sparse rewards, via quantifying their (dis)similarity with failure. Additionally, the value of the current behavior, formulated as advantage function, is employed based on the densified reward to refine the control policy’s search direction. We finally conduct several experiments to demonstrate the effectiveness of the proposed approach by comparing with state-of-art methods.\",\"PeriodicalId\":37235,\"journal\":{\"name\":\"IEEE Control Systems Letters\",\"volume\":\"9 \",\"pages\":\"2315-2320\"},\"PeriodicalIF\":2.0000,\"publicationDate\":\"2025-07-21\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"IEEE Control Systems Letters\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://ieeexplore.ieee.org/document/11087555/\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"AUTOMATION & CONTROL SYSTEMS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Control Systems Letters","FirstCategoryId":"1085","ListUrlMain":"https://ieeexplore.ieee.org/document/11087555/","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"AUTOMATION & CONTROL SYSTEMS","Score":null,"Total":0}
Robust Human-Machine Teaming Through Reinforcement Learning From Failure via Sparse Reward Densification
Learning control policies in sparse reward environments is a challenging task for many robotic control tasks. The existing studies focus on designing reinforcement learning algorithms that take human inputs in the form of demonstrations such that control policies are learned via uncovering the value of these demonstrations. One typical approach is to learn an inherent reward function that can explain why demonstrations are better than other randomly generated samples. Albeit powerful, the use of human demonstrations is typically costly and difficult to collect, indicating the lack of robustness in these studies. To enhance robustness, we here propose to use failed experiences, namely, failure, due to the easiness of obtaining failure dataset, requiring only common sense rather than domain knowledge needed to generate expert demonstrations. In particular, this letter proposes a new reward densification technique that trains a discriminator to evaluate the similarity between the agent’s current behavior and failure dataset provided by humans. This reward densification technique provides an effective mechanism to obtain state-action values for environments with sparse rewards, via quantifying their (dis)similarity with failure. Additionally, the value of the current behavior, formulated as advantage function, is employed based on the densified reward to refine the control policy’s search direction. We finally conduct several experiments to demonstrate the effectiveness of the proposed approach by comparing with state-of-art methods.