{"title":"A transformer based multi task learning approach to multimodal hate speech detection","authors":"Prashant Kapil , Asif Ekbal","doi":"10.1016/j.nlp.2025.100133","DOIUrl":null,"url":null,"abstract":"<div><div>Online hate speech has become a major social issue in recent years, affecting both individuals and society as a whole. Memes are a multimodal kind of internet hate speech that is growing more common. Online memes are often entertaining and harmless. The seemingly innocent meme, on the other hand, transforms into a multimodal form of hate speech—a hateful meme—when specific types of text, graphics, or combinations of both are used. The spread of these harmful or undesirable memes has the potential to disrupt societal peace. Therefore, it is vital to limit inappropriate memes on social media. Multimodal hate speech identification is an inherently difficult and open question. It necessitates collaborative language, visual perception, and multimodal reasoning. This line of research has been progressed in this work by building a multi-task learning-based multimodal system for detecting hateful memes by training four hateful meme data sets concurrently. This MTL framework, which consists of Contrastive Language Image Pretraining (CLIP), UNiversal Image-TExt Representation Learning (UNITER), and BERT, was trained collaboratively to transfer common knowledge while simultaneously training four meme datasets. The results show that the recommended strategy outperforms unimodal and multimodal approaches on four multilingual benchmark datasets, with considerable AUC-ROC, accuracy, and F1-score. The ablation studies are undertaken to emphasise the impact of the sub-component in the MTL model. The confusion matrix is shown as quantitative analysis.</div></div>","PeriodicalId":100944,"journal":{"name":"Natural Language Processing Journal","volume":"11 ","pages":"Article 100133"},"PeriodicalIF":0.0000,"publicationDate":"2025-02-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Natural Language Processing Journal","FirstCategoryId":"1085","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S2949719125000093","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
Online hate speech has become a major social issue in recent years, affecting both individuals and society as a whole. Memes are a multimodal kind of internet hate speech that is growing more common. Online memes are often entertaining and harmless. The seemingly innocent meme, on the other hand, transforms into a multimodal form of hate speech—a hateful meme—when specific types of text, graphics, or combinations of both are used. The spread of these harmful or undesirable memes has the potential to disrupt societal peace. Therefore, it is vital to limit inappropriate memes on social media. Multimodal hate speech identification is an inherently difficult and open question. It necessitates collaborative language, visual perception, and multimodal reasoning. This line of research has been progressed in this work by building a multi-task learning-based multimodal system for detecting hateful memes by training four hateful meme data sets concurrently. This MTL framework, which consists of Contrastive Language Image Pretraining (CLIP), UNiversal Image-TExt Representation Learning (UNITER), and BERT, was trained collaboratively to transfer common knowledge while simultaneously training four meme datasets. The results show that the recommended strategy outperforms unimodal and multimodal approaches on four multilingual benchmark datasets, with considerable AUC-ROC, accuracy, and F1-score. The ablation studies are undertaken to emphasise the impact of the sub-component in the MTL model. The confusion matrix is shown as quantitative analysis.