Natalia Zhang, Xinqi Wang, Qiwen Cui, Runlong Zhou, Sham M. Kakade, Simon S. Du
{"title":"Multi-Agent Reinforcement Learning from Human Feedback: Data Coverage and Algorithmic Techniques","authors":"Natalia Zhang, Xinqi Wang, Qiwen Cui, Runlong Zhou, Sham M. Kakade, Simon S. Du","doi":"arxiv-2409.00717","DOIUrl":null,"url":null,"abstract":"We initiate the study of Multi-Agent Reinforcement Learning from Human\nFeedback (MARLHF), exploring both theoretical foundations and empirical\nvalidations. We define the task as identifying Nash equilibrium from a\npreference-only offline dataset in general-sum games, a problem marked by the\nchallenge of sparse feedback signals. Our theory establishes the upper\ncomplexity bounds for Nash Equilibrium in effective MARLHF, demonstrating that\nsingle-policy coverage is inadequate and highlighting the importance of\nunilateral dataset coverage. These theoretical insights are verified through\ncomprehensive experiments. To enhance the practical performance, we further\nintroduce two algorithmic techniques. (1) We propose a Mean Squared Error (MSE)\nregularization along the time axis to achieve a more uniform reward\ndistribution and improve reward learning outcomes. (2) We utilize imitation\nlearning to approximate the reference policy, ensuring stability and\neffectiveness in training. Our findings underscore the multifaceted approach\nrequired for MARLHF, paving the way for effective preference-based multi-agent\nsystems.","PeriodicalId":501315,"journal":{"name":"arXiv - CS - Multiagent Systems","volume":"73 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Multiagent Systems","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2409.00717","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
We initiate the study of Multi-Agent Reinforcement Learning from Human
Feedback (MARLHF), exploring both theoretical foundations and empirical
validations. We define the task as identifying Nash equilibrium from a
preference-only offline dataset in general-sum games, a problem marked by the
challenge of sparse feedback signals. Our theory establishes the upper
complexity bounds for Nash Equilibrium in effective MARLHF, demonstrating that
single-policy coverage is inadequate and highlighting the importance of
unilateral dataset coverage. These theoretical insights are verified through
comprehensive experiments. To enhance the practical performance, we further
introduce two algorithmic techniques. (1) We propose a Mean Squared Error (MSE)
regularization along the time axis to achieve a more uniform reward
distribution and improve reward learning outcomes. (2) We utilize imitation
learning to approximate the reference policy, ensuring stability and
effectiveness in training. Our findings underscore the multifaceted approach
required for MARLHF, paving the way for effective preference-based multi-agent
systems.