All You Need is "Love": Evading Hate Speech Detection

Proceedings of the 11th ACM Workshop on Artificial Intelligence and Security Pub Date : 2018-01-15 DOI:10.1145/3270101.3270103

Tommi Gröndahl, Luca Pajola, Mika Juuti, M. Conti, N. Asokan

引用次数: 186

Abstract

With the spread of social networks and their unfortunate use for hate speech, automatic detection of the latter has become a pressing problem. In this paper, we reproduce seven state-of-the-art hate speech detection models from prior work, and show that they perform well only when tested on the same type of data they were trained on. Based on these results, we argue that for successful hate speech detection, model architecture is less important than the type of data and labeling criteria. We further show that all proposed detection techniques are brittle against adversaries who can (automatically) insert typos, change word boundaries or add innocuous words to the original hate speech. A combination of these methods is also effective against Google Perspective - a cutting-edge solution from industry. Our experiments demonstrate that adversarial training does not completely mitigate the attacks, and using character-level features makes the models systematically more attack-resistant than using word-level features.

查看原文本刊更多论文

所有你需要的是“爱”:逃避仇恨言论检测

随着社交网络的普及及其对仇恨言论的不幸使用，后者的自动检测已成为一个紧迫的问题。在本文中，我们从先前的工作中重现了七个最先进的仇恨言论检测模型，并表明它们只有在训练它们的同一类型数据上进行测试时才表现良好。基于这些结果，我们认为对于成功的仇恨言论检测，模型架构不如数据类型和标记标准重要。我们进一步表明，所有提出的检测技术对于可以(自动)插入错别字、更改单词边界或在原始仇恨言论中添加无害单词的对手来说都是脆弱的。这些方法的组合对谷歌Perspective(一种来自行业的前沿解决方案)也有效。我们的实验表明，对抗性训练并不能完全减轻攻击，使用字符级特征使模型比使用单词级特征更系统地抵抗攻击。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Proceedings of the 11th ACM Workshop on Artificial Intelligence and Security

自引率

0.00%

发文量