CESDQL：在奖励稀疏的多机器人协作中利用交流经验共享深度 Q-learning 实现可扩展性

IF 7.6 1区计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Knowledge-Based Systems Pub Date : 2024-11-12 DOI:10.1016/j.knosys.2024.112714

Muhammad Naveed Abbas , Paul Liston , Brian Lee , Yuansong Qiao

{"title":"CESDQL：在奖励稀疏的多机器人协作中利用交流经验共享深度 Q-learning 实现可扩展性","authors":"Muhammad Naveed Abbas , Paul Liston , Brian Lee , Yuansong Qiao","doi":"10.1016/j.knosys.2024.112714","DOIUrl":null,"url":null,"abstract":"<div><div>Owing to the massive transformation in industrial processes and logistics, warehouses are also undergoing advanced automation. The application of Autonomous Mobile Robots (a.k.a. multi-robots) is one of the important elements of overall warehousing automation. The autonomous collaborative behaviour of the multi-robots can be considered as employment on a control task and, thus, can be optimised using multi-agent reinforcement learning (MARL). Consequently, an autonomous warehouse is to be represented by an MARL environment. An MARL environment replicating an autonomous warehouse poses the challenge of exploration due to sparse reward leading to inefficient collaboration. This challenge aggravates further with an increase in the number of robots and the grid size, i.e., scalability. This research proposes <strong>C</strong>ommunicative <strong>E</strong>xperience-<strong>S</strong>haring <strong>D</strong>eep <strong>Q</strong>-<strong>L</strong>earning (CESDQL) based on Q-learning, a novel hybrid multi-robot communicative framework for scalability for MARL collaboration with sparse rewards, where exploration is challenging and makes collaboration difficult. CESDQL makes use of experience-sharing through collective sampling from the Experience (Replay) buffer and communication through Communicative Deep recurrent Q-network (CommDRQN), a Q-function approximator. Through empirical evaluation of CESDQL in a variety of collaborative scenarios, it is established that CESDQL outperforms the baselines in terms of convergence and stable learning. Overall, CESDQL achieves 5%, 69%, 60%, 211%, 171%, 3.8% & 10% more final accumulative training returns than the closest performing baseline by scenario, and, 27%, 10.33% & 573% more final average training returns than the closest performing baseline by the big-scale scenario.</div></div>","PeriodicalId":49939,"journal":{"name":"Knowledge-Based Systems","volume":"306 ","pages":"Article 112714"},"PeriodicalIF":7.6000,"publicationDate":"2024-11-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"CESDQL: Communicative experience-sharing deep Q-learning for scalability in multi-robot collaboration with sparse reward\",\"authors\":\"Muhammad Naveed Abbas , Paul Liston , Brian Lee , Yuansong Qiao\",\"doi\":\"10.1016/j.knosys.2024.112714\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><div>Owing to the massive transformation in industrial processes and logistics, warehouses are also undergoing advanced automation. The application of Autonomous Mobile Robots (a.k.a. multi-robots) is one of the important elements of overall warehousing automation. The autonomous collaborative behaviour of the multi-robots can be considered as employment on a control task and, thus, can be optimised using multi-agent reinforcement learning (MARL). Consequently, an autonomous warehouse is to be represented by an MARL environment. An MARL environment replicating an autonomous warehouse poses the challenge of exploration due to sparse reward leading to inefficient collaboration. This challenge aggravates further with an increase in the number of robots and the grid size, i.e., scalability. This research proposes <strong>C</strong>ommunicative <strong>E</strong>xperience-<strong>S</strong>haring <strong>D</strong>eep <strong>Q</strong>-<strong>L</strong>earning (CESDQL) based on Q-learning, a novel hybrid multi-robot communicative framework for scalability for MARL collaboration with sparse rewards, where exploration is challenging and makes collaboration difficult. CESDQL makes use of experience-sharing through collective sampling from the Experience (Replay) buffer and communication through Communicative Deep recurrent Q-network (CommDRQN), a Q-function approximator. Through empirical evaluation of CESDQL in a variety of collaborative scenarios, it is established that CESDQL outperforms the baselines in terms of convergence and stable learning. Overall, CESDQL achieves 5%, 69%, 60%, 211%, 171%, 3.8% & 10% more final accumulative training returns than the closest performing baseline by scenario, and, 27%, 10.33% & 573% more final average training returns than the closest performing baseline by the big-scale scenario.</div></div>\",\"PeriodicalId\":49939,\"journal\":{\"name\":\"Knowledge-Based Systems\",\"volume\":\"306 \",\"pages\":\"Article 112714\"},\"PeriodicalIF\":7.6000,\"publicationDate\":\"2024-11-12\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Knowledge-Based Systems\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S0950705124013480\",\"RegionNum\":1,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Knowledge-Based Systems","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0950705124013480","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

摘要

由于工业流程和物流的巨大变革，仓库也在经历着先进的自动化。自主移动机器人（又称多机器人）的应用是整个仓储自动化的重要组成部分之一。多机器人的自主协作行为可视为对控制任务的雇佣，因此可通过多代理强化学习（MARL）进行优化。因此，自主仓库应由 MARL 环境来表示。由于奖励稀少，导致协作效率低下，复制自主仓库的 MARL 环境给探索带来了挑战。随着机器人数量和网格大小（即可扩展性）的增加，这一挑战会进一步加剧。本研究提出了基于 Q-learning 的交流经验共享深度 Q-Learning（CESDQL），这是一种新颖的混合多机器人交流框架，可扩展用于具有稀疏奖励的 MARL 协作，在这种情况下，探索具有挑战性，导致协作困难。CESDQL 通过从经验（回放）缓冲区集体采样来实现经验共享，并通过 Q 函数近似器--交流型深度递归 Q 网络（CommDRQN）来实现交流。通过在各种协作场景中对 CESDQL 进行实证评估，可以确定 CESDQL 在收敛性和稳定学习方面优于基线。总体而言，CESDQL 的最终累积训练回报率分别比不同场景下表现最接近的基线高出 5%、69%、60%、211%、171%、3.8% & 10%，而在大尺度场景下，最终平均训练回报率分别比表现最接近的基线高出 27%、10.33% & 573%。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

CESDQL: Communicative experience-sharing deep Q-learning for scalability in multi-robot collaboration with sparse reward

Owing to the massive transformation in industrial processes and logistics, warehouses are also undergoing advanced automation. The application of Autonomous Mobile Robots (a.k.a. multi-robots) is one of the important elements of overall warehousing automation. The autonomous collaborative behaviour of the multi-robots can be considered as employment on a control task and, thus, can be optimised using multi-agent reinforcement learning (MARL). Consequently, an autonomous warehouse is to be represented by an MARL environment. An MARL environment replicating an autonomous warehouse poses the challenge of exploration due to sparse reward leading to inefficient collaboration. This challenge aggravates further with an increase in the number of robots and the grid size, i.e., scalability. This research proposes Communicative Experience-Sharing Deep Q-Learning (CESDQL) based on Q-learning, a novel hybrid multi-robot communicative framework for scalability for MARL collaboration with sparse rewards, where exploration is challenging and makes collaboration difficult. CESDQL makes use of experience-sharing through collective sampling from the Experience (Replay) buffer and communication through Communicative Deep recurrent Q-network (CommDRQN), a Q-function approximator. Through empirical evaluation of CESDQL in a variety of collaborative scenarios, it is established that CESDQL outperforms the baselines in terms of convergence and stable learning. Overall, CESDQL achieves 5%, 69%, 60%, 211%, 171%, 3.8% & 10% more final accumulative training returns than the closest performing baseline by scenario, and, 27%, 10.33% & 573% more final average training returns than the closest performing baseline by the big-scale scenario.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Knowledge-Based Systems 工程技术-计算机：人工智能

CiteScore

14.80

自引率

12.50%

发文量

1245

审稿时长

7.8 months

期刊介绍： Knowledge-Based Systems, an international and interdisciplinary journal in artificial intelligence, publishes original, innovative, and creative research results in the field. It focuses on knowledge-based and other artificial intelligence techniques-based systems. The journal aims to support human prediction and decision-making through data science and computation techniques, provide a balanced coverage of theory and practical study, and encourage the development and implementation of knowledge-based intelligence models, methods, systems, and software tools. Applications in business, government, education, engineering, and healthcare are emphasized.