Improving Zero-Shot Cross-Lingual Hate Speech Detection with Pseudo-Label Fine-Tuning of Transformer Language Models

International Conference on Web and Social Media Pub Date : 2022-05-31 DOI:10.1609/icwsm.v16i1.19402

Haris Bin Zia, Ignacio Castro, A. Zubiaga, Gareth Tyson

{"title":"Improving Zero-Shot Cross-Lingual Hate Speech Detection with Pseudo-Label Fine-Tuning of Transformer Language Models","authors":"Haris Bin Zia, Ignacio Castro, A. Zubiaga, Gareth Tyson","doi":"10.1609/icwsm.v16i1.19402","DOIUrl":null,"url":null,"abstract":"Hate speech has proliferated on social media platforms in recent years. While this has been the focus of many studies, most works have exclusively focused on a single language, generally English. Low-resourced languages have been neglected due to the dearth of labeled resources. These languages, however, represent an important portion of the data due to the multilingual nature of social media. This work presents a novel zero-shot, cross-lingual transfer learning pipeline based on pseudo-label fine-tuning of Transformer Language Models for automatic hate speech detection. We employ our pipeline on benchmark datasets covering English (source) and 6 different non-English (target) languages written in 3 different scripts. Our pipeline achieves an average improvement of 7.6% (in terms of macro-F1) over previous zero-shot, cross-lingual models. This demonstrates the feasibility of high accuracy automatic hate speech detection for low-resource languages. We release our code and models at https://github.com/harisbinzia/ZeroshotCrosslingualHateSpeech.","PeriodicalId":175641,"journal":{"name":"International Conference on Web and Social Media","volume":"41 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-05-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"11","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"International Conference on Web and Social Media","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1609/icwsm.v16i1.19402","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 11

Abstract

Hate speech has proliferated on social media platforms in recent years. While this has been the focus of many studies, most works have exclusively focused on a single language, generally English. Low-resourced languages have been neglected due to the dearth of labeled resources. These languages, however, represent an important portion of the data due to the multilingual nature of social media. This work presents a novel zero-shot, cross-lingual transfer learning pipeline based on pseudo-label fine-tuning of Transformer Language Models for automatic hate speech detection. We employ our pipeline on benchmark datasets covering English (source) and 6 different non-English (target) languages written in 3 different scripts. Our pipeline achieves an average improvement of 7.6% (in terms of macro-F1) over previous zero-shot, cross-lingual models. This demonstrates the feasibility of high accuracy automatic hate speech detection for low-resource languages. We release our code and models at https://github.com/harisbinzia/ZeroshotCrosslingualHateSpeech.

查看原文本刊更多论文

基于变形语言模型的伪标签微调改进零采样跨语言仇恨语音检测

近年来，社交媒体平台上的仇恨言论激增。虽然这一直是许多研究的焦点，但大多数作品都只关注一种语言，通常是英语。由于缺乏标记资源，低资源语言一直被忽视。然而，由于社交媒体的多语言特性，这些语言代表了数据的重要组成部分。这项工作提出了一种新的零采样、跨语言迁移学习管道，该管道基于Transformer语言模型的伪标签微调，用于自动仇恨语音检测。我们在基准数据集上使用我们的管道，这些数据集涵盖英语(源)和6种不同的非英语(目标)语言，用3种不同的脚本编写。与之前的零射击、跨语言模型相比，我们的管道实现了7.6%的平均改进(就宏观f1而言)。这证明了对低资源语言进行高精度仇恨语音自动检测的可行性。我们在https://github.com/harisbinzia/ZeroshotCrosslingualHateSpeech上发布代码和模型。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

International Conference on Web and Social Media

自引率

0.00%

发文量