基于主动偏好的高斯过程回归奖励学习与优化

IF 5 1区计算机科学 Q1 ROBOTICS

International Journal of Robotics Research Pub Date : 2023-11-07 DOI:10.1177/02783649231208729

Erdem Bıyık, Nicolas Huynh, Mykel J. Kochenderfer, Dorsa Sadigh

{"title":"基于主动偏好的高斯过程回归奖励学习与优化","authors":"Erdem Bıyık, Nicolas Huynh, Mykel J. Kochenderfer, Dorsa Sadigh","doi":"10.1177/02783649231208729","DOIUrl":null,"url":null,"abstract":"Designing reward functions is a difficult task in AI and robotics. The complex task of directly specifying all the desirable behaviors a robot needs to optimize often proves challenging for humans. A popular solution is to learn reward functions using expert demonstrations. This approach, however, is fraught with many challenges. Some methods require heavily structured models, for example, reward functions that are linear in some predefined set of features, while others adopt less structured reward functions that may necessitate tremendous amounts of data. Moreover, it is difficult for humans to provide demonstrations on robots with high degrees of freedom, or even quantifying reward values for given trajectories. To address these challenges, we present a preference-based learning approach, where human feedback is in the form of comparisons between trajectories. We do not assume highly constrained structures on the reward function. Instead, we employ a Gaussian process to model the reward function and propose a mathematical formulation to actively fit the model using only human preferences. Our approach enables us to tackle both inflexibility and data-inefficiency problems within a preference-based learning framework. We further analyze our algorithm in comparison to several baselines on reward optimization, where the goal is to find the optimal robot trajectory in a data-efficient way instead of learning the reward function for every possible trajectory. Our results in three different simulation experiments and a user study show our approach can efficiently learn expressive reward functions for robotic tasks, and outperform the baselines in both reward learning and reward optimization.","PeriodicalId":54942,"journal":{"name":"International Journal of Robotics Research","volume":"79 1","pages":"0"},"PeriodicalIF":5.0000,"publicationDate":"2023-11-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Active preference-based Gaussian process regression for reward learning and optimization\",\"authors\":\"Erdem Bıyık, Nicolas Huynh, Mykel J. Kochenderfer, Dorsa Sadigh\",\"doi\":\"10.1177/02783649231208729\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Designing reward functions is a difficult task in AI and robotics. The complex task of directly specifying all the desirable behaviors a robot needs to optimize often proves challenging for humans. A popular solution is to learn reward functions using expert demonstrations. This approach, however, is fraught with many challenges. Some methods require heavily structured models, for example, reward functions that are linear in some predefined set of features, while others adopt less structured reward functions that may necessitate tremendous amounts of data. Moreover, it is difficult for humans to provide demonstrations on robots with high degrees of freedom, or even quantifying reward values for given trajectories. To address these challenges, we present a preference-based learning approach, where human feedback is in the form of comparisons between trajectories. We do not assume highly constrained structures on the reward function. Instead, we employ a Gaussian process to model the reward function and propose a mathematical formulation to actively fit the model using only human preferences. Our approach enables us to tackle both inflexibility and data-inefficiency problems within a preference-based learning framework. We further analyze our algorithm in comparison to several baselines on reward optimization, where the goal is to find the optimal robot trajectory in a data-efficient way instead of learning the reward function for every possible trajectory. Our results in three different simulation experiments and a user study show our approach can efficiently learn expressive reward functions for robotic tasks, and outperform the baselines in both reward learning and reward optimization.\",\"PeriodicalId\":54942,\"journal\":{\"name\":\"International Journal of Robotics Research\",\"volume\":\"79 1\",\"pages\":\"0\"},\"PeriodicalIF\":5.0000,\"publicationDate\":\"2023-11-07\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"International Journal of Robotics Research\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1177/02783649231208729\",\"RegionNum\":1,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"ROBOTICS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"International Journal of Robotics Research","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1177/02783649231208729","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"ROBOTICS","Score":null,"Total":0}

引用次数: 0

摘要

在人工智能和机器人领域，设计奖励功能是一项艰巨的任务。直接指定机器人需要优化的所有理想行为的复杂任务通常对人类来说是具有挑战性的。一个流行的解决方案是通过专家演示来学习奖励函数。然而，这种方法充满了许多挑战。有些方法需要高度结构化的模型，例如，奖励函数在一些预定义的特征集合中是线性的，而另一些方法采用较少结构化的奖励函数，这可能需要大量的数据。此外，人类很难对具有高度自由度的机器人进行演示，甚至很难量化给定轨迹的奖励值。为了应对这些挑战，我们提出了一种基于偏好的学习方法，其中人类的反馈以轨迹之间比较的形式存在。我们不假设奖励函数有高度约束的结构。相反，我们采用高斯过程来模拟奖励函数，并提出一个数学公式，仅使用人类偏好来主动拟合模型。我们的方法使我们能够在基于偏好的学习框架内解决缺乏灵活性和数据效率低下的问题。我们进一步分析了我们的算法，并将其与奖励优化的几个基线进行了比较，其中的目标是以数据有效的方式找到最优机器人轨迹，而不是学习每个可能轨迹的奖励函数。我们在三个不同的模拟实验和一个用户研究中的结果表明，我们的方法可以有效地学习机器人任务的表达性奖励函数，并且在奖励学习和奖励优化方面都优于基线。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Active preference-based Gaussian process regression for reward learning and optimization

Designing reward functions is a difficult task in AI and robotics. The complex task of directly specifying all the desirable behaviors a robot needs to optimize often proves challenging for humans. A popular solution is to learn reward functions using expert demonstrations. This approach, however, is fraught with many challenges. Some methods require heavily structured models, for example, reward functions that are linear in some predefined set of features, while others adopt less structured reward functions that may necessitate tremendous amounts of data. Moreover, it is difficult for humans to provide demonstrations on robots with high degrees of freedom, or even quantifying reward values for given trajectories. To address these challenges, we present a preference-based learning approach, where human feedback is in the form of comparisons between trajectories. We do not assume highly constrained structures on the reward function. Instead, we employ a Gaussian process to model the reward function and propose a mathematical formulation to actively fit the model using only human preferences. Our approach enables us to tackle both inflexibility and data-inefficiency problems within a preference-based learning framework. We further analyze our algorithm in comparison to several baselines on reward optimization, where the goal is to find the optimal robot trajectory in a data-efficient way instead of learning the reward function for every possible trajectory. Our results in three different simulation experiments and a user study show our approach can efficiently learn expressive reward functions for robotic tasks, and outperform the baselines in both reward learning and reward optimization.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

International Journal of Robotics Research 工程技术-机器人学

CiteScore

22.20

自引率

0.00%

发文量

审稿时长

6-12 weeks

期刊介绍： The International Journal of Robotics Research (IJRR) has been a leading peer-reviewed publication in the field for over two decades. It holds the distinction of being the first scholarly journal dedicated to robotics research. IJRR presents cutting-edge and thought-provoking original research papers, articles, and reviews that delve into groundbreaking trends, technical advancements, and theoretical developments in robotics. Renowned scholars and practitioners contribute to its content, offering their expertise and insights. This journal covers a wide range of topics, going beyond narrow technical advancements to encompass various aspects of robotics. The primary aim of IJRR is to publish work that has lasting value for the scientific and technological advancement of the field. Only original, robust, and practical research that can serve as a foundation for further progress is considered for publication. The focus is on producing content that will remain valuable and relevant over time. In summary, IJRR stands as a prestigious publication that drives innovation and knowledge in robotics research.