From Lists to Emojis: How Format Bias Affects Model Alignment

arXiv - CS - Computation and Language Pub Date : 2024-09-18 DOI:arxiv-2409.11704

Xuanchang Zhang, Wei Xiong, Lichang Chen, Tianyi Zhou, Heng Huang, Tong Zhang

{"title":"From Lists to Emojis: How Format Bias Affects Model Alignment","authors":"Xuanchang Zhang, Wei Xiong, Lichang Chen, Tianyi Zhou, Heng Huang, Tong Zhang","doi":"arxiv-2409.11704","DOIUrl":null,"url":null,"abstract":"In this paper, we study format biases in reinforcement learning from human\nfeedback (RLHF). We observe that many widely-used preference models, including\nhuman evaluators, GPT-4, and top-ranking models on the RewardBench benchmark,\nexhibit strong biases towards specific format patterns, such as lists, links,\nbold text, and emojis. Furthermore, large language models (LLMs) can exploit\nthese biases to achieve higher rankings on popular benchmarks like AlpacaEval\nand LMSYS Chatbot Arena. One notable example of this is verbosity bias, where\ncurrent preference models favor longer responses that appear more\ncomprehensive, even when their quality is equal to or lower than shorter,\ncompeting responses. However, format biases beyond verbosity remain largely\nunderexplored in the literature. In this work, we extend the study of biases in\npreference learning beyond the commonly recognized length bias, offering a\ncomprehensive analysis of a wider range of format biases. Additionally, we show\nthat with a small amount of biased data (less than 1%), we can inject\nsignificant bias into the reward model. Moreover, these format biases can also\nbe easily exploited by downstream alignment algorithms, such as best-of-n\nsampling and online iterative DPO, as it is usually easier to manipulate the\nformat than to improve the quality of responses. Our findings emphasize the\nneed to disentangle format and content both for designing alignment algorithms\nand evaluating models.","PeriodicalId":501030,"journal":{"name":"arXiv - CS - Computation and Language","volume":"24 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Computation and Language","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2409.11704","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

In this paper, we study format biases in reinforcement learning from human feedback (RLHF). We observe that many widely-used preference models, including human evaluators, GPT-4, and top-ranking models on the RewardBench benchmark, exhibit strong biases towards specific format patterns, such as lists, links, bold text, and emojis. Furthermore, large language models (LLMs) can exploit these biases to achieve higher rankings on popular benchmarks like AlpacaEval and LMSYS Chatbot Arena. One notable example of this is verbosity bias, where current preference models favor longer responses that appear more comprehensive, even when their quality is equal to or lower than shorter, competing responses. However, format biases beyond verbosity remain largely underexplored in the literature. In this work, we extend the study of biases in preference learning beyond the commonly recognized length bias, offering a comprehensive analysis of a wider range of format biases. Additionally, we show that with a small amount of biased data (less than 1%), we can inject significant bias into the reward model. Moreover, these format biases can also be easily exploited by downstream alignment algorithms, such as best-of-n sampling and online iterative DPO, as it is usually easier to manipulate the format than to improve the quality of responses. Our findings emphasize the need to disentangle format and content both for designing alignment algorithms and evaluating models.

查看原文本刊更多论文

从列表到表情符号：格式偏差如何影响模型对齐

本文研究了人类反馈强化学习（RLHF）中的格式偏差。我们观察到，许多广泛使用的偏好模型，包括人类评估者、GPT-4 和 RewardBench 基准上排名靠前的模型，都表现出对特定格式模式的强烈偏好，如列表、链接、粗体文字和表情符号。此外，大型语言模型（LLM）可以利用这些偏差在 AlpacaEval 和 LMSYS Chatbot Arena 等流行基准测试中获得更高的排名。其中一个明显的例子就是 "冗长度偏差"（verbosity bias），在这种情况下，当前的偏好模型会倾向于看起来更全面的较长的回复，即使其质量等同于或低于较短的竞争性回复。然而，文献中对文字量之外的格式偏差基本上还没有进行深入探讨。在这项工作中，我们将偏好学习中的偏差研究扩展到了公认的长度偏差之外，对更广泛的格式偏差进行了全面分析。此外，我们还证明，只需少量有偏差的数据（小于 1%），我们就能为奖励模型注入显著的偏差。此外，这些格式偏差也很容易被下游配准算法（如最佳采样和在线迭代 DPO）利用，因为通常操纵格式比提高响应质量更容易。我们的研究结果强调，在设计配准算法和评估模型时都需要将格式和内容分开。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

arXiv - CS - Computation and Language

自引率

0.00%

发文量