Robust Conversational Agents against Imperceptible Toxicity Triggers

Ninareh Mehrabi, Ahmad Beirami, Fred Morstatter, A. Galstyan
{"title":"Robust Conversational Agents against Imperceptible Toxicity Triggers","authors":"Ninareh Mehrabi, Ahmad Beirami, Fred Morstatter, A. Galstyan","doi":"10.48550/arXiv.2205.02392","DOIUrl":null,"url":null,"abstract":"Warning: this paper contains content that maybe offensive or upsetting.Recent research in Natural Language Processing (NLP) has advanced the development of various toxicity detection models with the intention of identifying and mitigating toxic language from existing systems. Despite the abundance of research in this area, less attention has been given to adversarial attacks that force the system to generate toxic language and the defense against them. Existing work to generate such attacks is either based on human-generated attacks which is costly and not scalable or, in case of automatic attacks, the attack vector does not conform to human-like language, which can be detected using a language model loss. In this work, we propose attacks against conversational agents that are imperceptible, i.e., they fit the conversation in terms of coherency, relevancy, and fluency, while they are effective and scalable, i.e., they can automatically trigger the system into generating toxic language. We then propose a defense mechanism against such attacks which not only mitigates the attack but also attempts to maintain the conversational flow. Through automatic and human evaluations, we show that our defense is effective at avoiding toxic language generation even against imperceptible toxicity triggers while the generated language fits the conversation in terms of coherency and relevancy. Lastly, we establish the generalizability of such a defense mechanism on language generation models beyond conversational agents.","PeriodicalId":382084,"journal":{"name":"North American Chapter of the Association for Computational Linguistics","volume":"51 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-05-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"12","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"North American Chapter of the Association for Computational Linguistics","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.48550/arXiv.2205.02392","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 12

Abstract

Warning: this paper contains content that maybe offensive or upsetting.Recent research in Natural Language Processing (NLP) has advanced the development of various toxicity detection models with the intention of identifying and mitigating toxic language from existing systems. Despite the abundance of research in this area, less attention has been given to adversarial attacks that force the system to generate toxic language and the defense against them. Existing work to generate such attacks is either based on human-generated attacks which is costly and not scalable or, in case of automatic attacks, the attack vector does not conform to human-like language, which can be detected using a language model loss. In this work, we propose attacks against conversational agents that are imperceptible, i.e., they fit the conversation in terms of coherency, relevancy, and fluency, while they are effective and scalable, i.e., they can automatically trigger the system into generating toxic language. We then propose a defense mechanism against such attacks which not only mitigates the attack but also attempts to maintain the conversational flow. Through automatic and human evaluations, we show that our defense is effective at avoiding toxic language generation even against imperceptible toxicity triggers while the generated language fits the conversation in terms of coherency and relevancy. Lastly, we establish the generalizability of such a defense mechanism on language generation models beyond conversational agents.
针对难以察觉的毒性触发的健壮会话代理
警告:本文包含可能冒犯或令人不快的内容。自然语言处理(NLP)的最新研究推动了各种毒性检测模型的发展,旨在从现有系统中识别和减轻毒性语言。尽管在这一领域有大量的研究,但很少有人关注迫使系统产生有毒语言的对抗性攻击以及对它们的防御。生成此类攻击的现有工作要么是基于人工生成的攻击,这种攻击代价高昂且不可扩展,要么是在自动攻击的情况下,攻击向量不符合类人语言,这可以使用语言模型损失来检测。在这项工作中,我们提出了针对难以察觉的会话代理的攻击,即,它们在一致性,相关性和流畅性方面适合会话,同时它们是有效和可扩展的,即,它们可以自动触发系统生成有毒语言。然后,我们提出了一种防御机制,该机制不仅可以减轻攻击,还可以尝试保持会话流。通过自动和人工评估,我们表明我们的防御在避免有毒语言生成方面是有效的,即使是针对难以察觉的毒性触发,而生成的语言在连贯性和相关性方面符合对话。最后,我们在会话代理之外的语言生成模型上建立了这种防御机制的泛化性。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信