The Inadequacy of Reinforcement Learning From Human Feedback—Radicalizing Large Language Models via Semantic Vulnerabilities

IF 5 3区 计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE
Timothy R. McIntosh;Teo Susnjak;Tong Liu;Paul Watters;Malka N. Halgamuge
{"title":"The Inadequacy of Reinforcement Learning From Human Feedback—Radicalizing Large Language Models via Semantic Vulnerabilities","authors":"Timothy R. McIntosh;Teo Susnjak;Tong Liu;Paul Watters;Malka N. Halgamuge","doi":"10.1109/TCDS.2024.3377445","DOIUrl":null,"url":null,"abstract":"This study is an empirical investigation into the semantic vulnerabilities of four popular pretrained commercial large language models (LLMs) to ideological manipulation. Using tactics reminiscent of human semantic conditioning in psychology, we have induced and assessed ideological misalignments and their retention in four commercial pretrained LLMs, in response to 30 controversial questions that spanned a broad ideological and social spectrum, encompassing both extreme left- and right-wing viewpoints. Such semantic vulnerabilities arise due to fundamental limitations in LLMs’ capability to comprehend detailed linguistic variations, making them susceptible to ideological manipulation through targeted semantic exploits. We observed reinforcement learning from human feedback (RLHF) in effect to LLM initial answers, but highlighted the limitations of RLHF in two aspects: 1) its inability to fully mitigate the impact of ideological conditioning prompts, leading to partial alleviation of LLM semantic vulnerabilities; and 2) its inadequacy in representing a diverse set of “human values,” often reflecting the predefined values of certain groups controlling the LLMs. Our findings have provided empirical evidence of semantic vulnerabilities inherent in current LLMs, challenged both the robustness and the adequacy of RLHF as a mainstream method for aligning LLMs with human values, and underscored the need for a multidisciplinary approach in developing ethical and resilient artificial intelligence (AI).","PeriodicalId":54300,"journal":{"name":"IEEE Transactions on Cognitive and Developmental Systems","volume":"16 4","pages":"1561-1574"},"PeriodicalIF":5.0000,"publicationDate":"2024-03-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Cognitive and Developmental Systems","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/10474163/","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
引用次数: 0

Abstract

This study is an empirical investigation into the semantic vulnerabilities of four popular pretrained commercial large language models (LLMs) to ideological manipulation. Using tactics reminiscent of human semantic conditioning in psychology, we have induced and assessed ideological misalignments and their retention in four commercial pretrained LLMs, in response to 30 controversial questions that spanned a broad ideological and social spectrum, encompassing both extreme left- and right-wing viewpoints. Such semantic vulnerabilities arise due to fundamental limitations in LLMs’ capability to comprehend detailed linguistic variations, making them susceptible to ideological manipulation through targeted semantic exploits. We observed reinforcement learning from human feedback (RLHF) in effect to LLM initial answers, but highlighted the limitations of RLHF in two aspects: 1) its inability to fully mitigate the impact of ideological conditioning prompts, leading to partial alleviation of LLM semantic vulnerabilities; and 2) its inadequacy in representing a diverse set of “human values,” often reflecting the predefined values of certain groups controlling the LLMs. Our findings have provided empirical evidence of semantic vulnerabilities inherent in current LLMs, challenged both the robustness and the adequacy of RLHF as a mainstream method for aligning LLMs with human values, and underscored the need for a multidisciplinary approach in developing ethical and resilient artificial intelligence (AI).
从人类反馈中强化学习的不足--通过语义漏洞激化大型语言模型
本研究是对四种流行的预训练商业大语言模型(LLM)在意识形态操纵下的语义脆弱性进行的实证调查。我们采用与心理学中的人类语义条件反射类似的策略,诱导并评估了四种商业预训练大语言模型在回答 30 个有争议的问题时的意识形态错位及其保留情况,这些问题涉及广泛的意识形态和社会范围,包括极左和极右观点。这种语义漏洞的产生是由于 LLMs 理解语言细节变化的能力存在根本性的限制,这使得它们很容易被有针对性的语义利用来进行意识形态操纵。我们观察到从人类反馈中强化学习(RLHF)对 LLM 初始答案的影响,但强调了 RLHF 在两个方面的局限性:1)它无法完全缓解意识形态条件提示的影响,导致部分减轻了 LLM 的语义漏洞;以及 2)它无法充分体现多样化的 "人类价值观",往往反映的是控制 LLM 的某些群体的预定义价值观。我们的研究结果提供了当前 LLM 固有语义漏洞的实证证据,对 RLHF 作为使 LLM 符合人类价值观的主流方法的稳健性和适当性提出了质疑,并强调了在开发符合伦理和具有弹性的人工智能(AI)时采用多学科方法的必要性。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
CiteScore
7.20
自引率
10.00%
发文量
170
期刊介绍: The IEEE Transactions on Cognitive and Developmental Systems (TCDS) focuses on advances in the study of development and cognition in natural (humans, animals) and artificial (robots, agents) systems. It welcomes contributions from multiple related disciplines including cognitive systems, cognitive robotics, developmental and epigenetic robotics, autonomous and evolutionary robotics, social structures, multi-agent and artificial life systems, computational neuroscience, and developmental psychology. Articles on theoretical, computational, application-oriented, and experimental studies as well as reviews in these areas are considered.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信