{"title":"Towards Data-Centric RLHF: Simple Metrics for Preference Dataset Comparison","authors":"Judy Hanwen Shen, Archit Sharma, Jun Qin","doi":"arxiv-2409.09603","DOIUrl":null,"url":null,"abstract":"The goal of aligning language models to human preferences requires data that\nreveal these preferences. Ideally, time and money can be spent carefully\ncollecting and tailoring bespoke preference data to each downstream\napplication. However, in practice, a select few publicly available preference\ndatasets are often used to train reward models for reinforcement learning from\nhuman feedback (RLHF). While new preference datasets are being introduced with\nincreasing frequency, there are currently no existing efforts to measure and\ncompare these datasets. In this paper, we systematically study preference\ndatasets through three perspectives: scale, label noise, and information\ncontent. We propose specific metrics for each of these perspectives and uncover\ndifferent axes of comparison for a better understanding of preference datasets.\nOur work is a first step towards a data-centric approach to alignment by\nproviding perspectives that aid in training efficiency and iterative data\ncollection for RLHF.","PeriodicalId":501479,"journal":{"name":"arXiv - CS - Artificial Intelligence","volume":"20 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-09-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Artificial Intelligence","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2409.09603","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
The goal of aligning language models to human preferences requires data that
reveal these preferences. Ideally, time and money can be spent carefully
collecting and tailoring bespoke preference data to each downstream
application. However, in practice, a select few publicly available preference
datasets are often used to train reward models for reinforcement learning from
human feedback (RLHF). While new preference datasets are being introduced with
increasing frequency, there are currently no existing efforts to measure and
compare these datasets. In this paper, we systematically study preference
datasets through three perspectives: scale, label noise, and information
content. We propose specific metrics for each of these perspectives and uncover
different axes of comparison for a better understanding of preference datasets.
Our work is a first step towards a data-centric approach to alignment by
providing perspectives that aid in training efficiency and iterative data
collection for RLHF.