All You Need is "Love": Evading Hate Speech Detection

Tommi Gröndahl, Luca Pajola, Mika Juuti, M. Conti, N. Asokan
{"title":"All You Need is \"Love\": Evading Hate Speech Detection","authors":"Tommi Gröndahl, Luca Pajola, Mika Juuti, M. Conti, N. Asokan","doi":"10.1145/3270101.3270103","DOIUrl":null,"url":null,"abstract":"With the spread of social networks and their unfortunate use for hate speech, automatic detection of the latter has become a pressing problem. In this paper, we reproduce seven state-of-the-art hate speech detection models from prior work, and show that they perform well only when tested on the same type of data they were trained on. Based on these results, we argue that for successful hate speech detection, model architecture is less important than the type of data and labeling criteria. We further show that all proposed detection techniques are brittle against adversaries who can (automatically) insert typos, change word boundaries or add innocuous words to the original hate speech. A combination of these methods is also effective against Google Perspective - a cutting-edge solution from industry. Our experiments demonstrate that adversarial training does not completely mitigate the attacks, and using character-level features makes the models systematically more attack-resistant than using word-level features.","PeriodicalId":132293,"journal":{"name":"Proceedings of the 11th ACM Workshop on Artificial Intelligence and Security","volume":"349 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2018-01-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"186","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 11th ACM Workshop on Artificial Intelligence and Security","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3270101.3270103","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 186

Abstract

With the spread of social networks and their unfortunate use for hate speech, automatic detection of the latter has become a pressing problem. In this paper, we reproduce seven state-of-the-art hate speech detection models from prior work, and show that they perform well only when tested on the same type of data they were trained on. Based on these results, we argue that for successful hate speech detection, model architecture is less important than the type of data and labeling criteria. We further show that all proposed detection techniques are brittle against adversaries who can (automatically) insert typos, change word boundaries or add innocuous words to the original hate speech. A combination of these methods is also effective against Google Perspective - a cutting-edge solution from industry. Our experiments demonstrate that adversarial training does not completely mitigate the attacks, and using character-level features makes the models systematically more attack-resistant than using word-level features.
所有你需要的是“爱”:逃避仇恨言论检测
随着社交网络的普及及其对仇恨言论的不幸使用,后者的自动检测已成为一个紧迫的问题。在本文中,我们从先前的工作中重现了七个最先进的仇恨言论检测模型,并表明它们只有在训练它们的同一类型数据上进行测试时才表现良好。基于这些结果,我们认为对于成功的仇恨言论检测,模型架构不如数据类型和标记标准重要。我们进一步表明,所有提出的检测技术对于可以(自动)插入错别字、更改单词边界或在原始仇恨言论中添加无害单词的对手来说都是脆弱的。这些方法的组合对谷歌Perspective(一种来自行业的前沿解决方案)也有效。我们的实验表明,对抗性训练并不能完全减轻攻击,使用字符级特征使模型比使用单词级特征更系统地抵抗攻击。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信