Helpful, harmless, honest? Sociotechnical limits of AI alignment and safety through Reinforcement Learning from Human Feedback.

IF 3.4 2区 哲学 Q1 ETHICS
Ethics and Information Technology Pub Date : 2025-01-01 Epub Date: 2025-06-04 DOI:10.1007/s10676-025-09837-2
Adam Dahlgren Lindström, Leila Methnani, Lea Krause, Petter Ericson, Íñigo Martínez de Rituerto de Troya, Dimitri Coelho Mollo, Roel Dobbe
{"title":"Helpful, harmless, honest? Sociotechnical limits of AI alignment and safety through Reinforcement Learning from Human Feedback.","authors":"Adam Dahlgren Lindström, Leila Methnani, Lea Krause, Petter Ericson, Íñigo Martínez de Rituerto de Troya, Dimitri Coelho Mollo, Roel Dobbe","doi":"10.1007/s10676-025-09837-2","DOIUrl":null,"url":null,"abstract":"<p><p>This paper critically evaluates the attempts to align Artificial Intelligence (AI) systems, especially Large Language Models (LLMs), with human values and intentions through Reinforcement Learning from Feedback methods, involving either human feedback (RLHF) or AI feedback (RLAIF). Specifically, we show the shortcomings of the broadly pursued alignment goals of honesty, harmlessness, and helpfulness. Through a multidisciplinary sociotechnical critique, we examine both the theoretical underpinnings and practical implementations of RLHF techniques, revealing significant limitations in their approach to capturing the complexities of human ethics, and contributing to AI safety. We highlight tensions inherent in the goals of RLHF, as captured in the HHH principle (helpful, harmless and honest). In addition, we discuss ethically-relevant issues that tend to be neglected in discussions about alignment and RLHF, among which the trade-offs between user-friendliness and deception, flexibility and interpretability, and system safety. We offer an alternative vision for AI safety and ethics which positions RLHF approaches within a broader context of comprehensive design across institutions, processes and technological systems, and suggest the establishment of AI safety as a sociotechnical discipline that is open to the normative and political dimensions of artificial intelligence.</p>","PeriodicalId":51495,"journal":{"name":"Ethics and Information Technology","volume":"27 2","pages":"28"},"PeriodicalIF":3.4000,"publicationDate":"2025-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12137480/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Ethics and Information Technology","FirstCategoryId":"91","ListUrlMain":"https://doi.org/10.1007/s10676-025-09837-2","RegionNum":2,"RegionCategory":"哲学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2025/6/4 0:00:00","PubModel":"Epub","JCR":"Q1","JCRName":"ETHICS","Score":null,"Total":0}
引用次数: 0

Abstract

This paper critically evaluates the attempts to align Artificial Intelligence (AI) systems, especially Large Language Models (LLMs), with human values and intentions through Reinforcement Learning from Feedback methods, involving either human feedback (RLHF) or AI feedback (RLAIF). Specifically, we show the shortcomings of the broadly pursued alignment goals of honesty, harmlessness, and helpfulness. Through a multidisciplinary sociotechnical critique, we examine both the theoretical underpinnings and practical implementations of RLHF techniques, revealing significant limitations in their approach to capturing the complexities of human ethics, and contributing to AI safety. We highlight tensions inherent in the goals of RLHF, as captured in the HHH principle (helpful, harmless and honest). In addition, we discuss ethically-relevant issues that tend to be neglected in discussions about alignment and RLHF, among which the trade-offs between user-friendliness and deception, flexibility and interpretability, and system safety. We offer an alternative vision for AI safety and ethics which positions RLHF approaches within a broader context of comprehensive design across institutions, processes and technological systems, and suggest the establishment of AI safety as a sociotechnical discipline that is open to the normative and political dimensions of artificial intelligence.

有用的,无害的,诚实的?通过从人类反馈中强化学习的人工智能对齐和安全的社会技术限制。
本文批判性地评估了人工智能(AI)系统,特别是大型语言模型(llm),通过从反馈方法中强化学习,包括人类反馈(RLHF)或人工智能反馈(RLAIF),与人类价值观和意图相一致的尝试。具体地说,我们展示了广泛追求的诚实、无害和乐于助人的对齐目标的缺点。通过多学科的社会技术批判,我们研究了RLHF技术的理论基础和实际实施,揭示了它们在捕捉人类伦理复杂性和促进人工智能安全方面的重大局限性。我们强调了RLHF目标中固有的紧张关系,正如HHH原则(有益、无害和诚实)所描述的那样。此外,我们还讨论了在关于校准和RLHF的讨论中往往被忽视的伦理相关问题,其中包括用户友好与欺骗、灵活性与可解释性以及系统安全性之间的权衡。我们为人工智能安全和伦理提供了另一种愿景,将RLHF方法置于跨机构、流程和技术系统的综合设计的更广泛背景下,并建议将人工智能安全建立为一门社会技术学科,对人工智能的规范和政治层面开放。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
CiteScore
8.20
自引率
5.60%
发文量
46
期刊介绍: Ethics and Information Technology is a peer-reviewed journal dedicated to advancing the dialogue between moral philosophy and the field of information and communication technology (ICT). The journal aims to foster and promote reflection and analysis which is intended to make a constructive contribution to answering the ethical, social and political questions associated with the adoption, use, and development of ICT. Within the scope of the journal are also conceptual analysis and discussion of ethical ICT issues which arise in the context of technology assessment, cultural studies, public policy analysis and public administration, cognitive science, social and anthropological studies in technology, mass-communication, and legal studies.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信