A transformer based multi task learning approach to multimodal hate speech detection

Natural Language Processing Journal Pub Date : 2025-02-20 DOI:10.1016/j.nlp.2025.100133

Prashant Kapil , Asif Ekbal

{"title":"A transformer based multi task learning approach to multimodal hate speech detection","authors":"Prashant Kapil , Asif Ekbal","doi":"10.1016/j.nlp.2025.100133","DOIUrl":null,"url":null,"abstract":"<div><div>Online hate speech has become a major social issue in recent years, affecting both individuals and society as a whole. Memes are a multimodal kind of internet hate speech that is growing more common. Online memes are often entertaining and harmless. The seemingly innocent meme, on the other hand, transforms into a multimodal form of hate speech—a hateful meme—when specific types of text, graphics, or combinations of both are used. The spread of these harmful or undesirable memes has the potential to disrupt societal peace. Therefore, it is vital to limit inappropriate memes on social media. Multimodal hate speech identification is an inherently difficult and open question. It necessitates collaborative language, visual perception, and multimodal reasoning. This line of research has been progressed in this work by building a multi-task learning-based multimodal system for detecting hateful memes by training four hateful meme data sets concurrently. This MTL framework, which consists of Contrastive Language Image Pretraining (CLIP), UNiversal Image-TExt Representation Learning (UNITER), and BERT, was trained collaboratively to transfer common knowledge while simultaneously training four meme datasets. The results show that the recommended strategy outperforms unimodal and multimodal approaches on four multilingual benchmark datasets, with considerable AUC-ROC, accuracy, and F1-score. The ablation studies are undertaken to emphasise the impact of the sub-component in the MTL model. The confusion matrix is shown as quantitative analysis.</div></div>","PeriodicalId":100944,"journal":{"name":"Natural Language Processing Journal","volume":"11 ","pages":"Article 100133"},"PeriodicalIF":0.0000,"publicationDate":"2025-02-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Natural Language Processing Journal","FirstCategoryId":"1085","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S2949719125000093","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Online hate speech has become a major social issue in recent years, affecting both individuals and society as a whole. Memes are a multimodal kind of internet hate speech that is growing more common. Online memes are often entertaining and harmless. The seemingly innocent meme, on the other hand, transforms into a multimodal form of hate speech—a hateful meme—when specific types of text, graphics, or combinations of both are used. The spread of these harmful or undesirable memes has the potential to disrupt societal peace. Therefore, it is vital to limit inappropriate memes on social media. Multimodal hate speech identification is an inherently difficult and open question. It necessitates collaborative language, visual perception, and multimodal reasoning. This line of research has been progressed in this work by building a multi-task learning-based multimodal system for detecting hateful memes by training four hateful meme data sets concurrently. This MTL framework, which consists of Contrastive Language Image Pretraining (CLIP), UNiversal Image-TExt Representation Learning (UNITER), and BERT, was trained collaboratively to transfer common knowledge while simultaneously training four meme datasets. The results show that the recommended strategy outperforms unimodal and multimodal approaches on four multilingual benchmark datasets, with considerable AUC-ROC, accuracy, and F1-score. The ablation studies are undertaken to emphasise the impact of the sub-component in the MTL model. The confusion matrix is shown as quantitative analysis.

查看原文本刊更多论文

基于变压器的多任务学习方法用于多模态仇恨语音检测

近年来，网络仇恨言论已经成为一个重大的社会问题，影响着个人和整个社会。模因是一种多模式的网络仇恨言论，正变得越来越普遍。网络表情包通常是娱乐和无害的。另一方面，当使用特定类型的文本、图形或两者的组合时，看似无辜的模因就会转变为多模态形式的仇恨言论——一个可恨的模因。这些有害或不受欢迎的表情包的传播有可能破坏社会和平。因此，限制社交媒体上不恰当的表情包是至关重要的。多模态仇恨言论识别本质上是一个困难而开放的问题。它需要协作语言、视觉感知和多模态推理。在这方面的研究已经取得了进展，通过同时训练四个仇恨模因数据集，构建了一个基于多任务学习的多模态系统来检测仇恨模因。该MTL框架由对比语言图像预训练（CLIP）、通用图像-文本表示学习（UNITER）和BERT组成，在同时训练四个模因数据集的同时协同训练以转移共同知识。结果表明，在四种多语言基准数据集上，推荐的策略优于单模态和多模态方法，具有可观的AUC-ROC、准确性和f1得分。进行烧蚀研究是为了强调MTL模型中子分量的影响。混淆矩阵表示为定量分析。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Natural Language Processing Journal

自引率

0.00%

发文量