{"title":"基于变压器的多任务学习方法用于多模态仇恨语音检测","authors":"Prashant Kapil , Asif Ekbal","doi":"10.1016/j.nlp.2025.100133","DOIUrl":null,"url":null,"abstract":"<div><div>Online hate speech has become a major social issue in recent years, affecting both individuals and society as a whole. Memes are a multimodal kind of internet hate speech that is growing more common. Online memes are often entertaining and harmless. The seemingly innocent meme, on the other hand, transforms into a multimodal form of hate speech—a hateful meme—when specific types of text, graphics, or combinations of both are used. The spread of these harmful or undesirable memes has the potential to disrupt societal peace. Therefore, it is vital to limit inappropriate memes on social media. Multimodal hate speech identification is an inherently difficult and open question. It necessitates collaborative language, visual perception, and multimodal reasoning. This line of research has been progressed in this work by building a multi-task learning-based multimodal system for detecting hateful memes by training four hateful meme data sets concurrently. This MTL framework, which consists of Contrastive Language Image Pretraining (CLIP), UNiversal Image-TExt Representation Learning (UNITER), and BERT, was trained collaboratively to transfer common knowledge while simultaneously training four meme datasets. The results show that the recommended strategy outperforms unimodal and multimodal approaches on four multilingual benchmark datasets, with considerable AUC-ROC, accuracy, and F1-score. The ablation studies are undertaken to emphasise the impact of the sub-component in the MTL model. The confusion matrix is shown as quantitative analysis.</div></div>","PeriodicalId":100944,"journal":{"name":"Natural Language Processing Journal","volume":"11 ","pages":"Article 100133"},"PeriodicalIF":0.0000,"publicationDate":"2025-02-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"A transformer based multi task learning approach to multimodal hate speech detection\",\"authors\":\"Prashant Kapil , Asif Ekbal\",\"doi\":\"10.1016/j.nlp.2025.100133\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><div>Online hate speech has become a major social issue in recent years, affecting both individuals and society as a whole. Memes are a multimodal kind of internet hate speech that is growing more common. Online memes are often entertaining and harmless. The seemingly innocent meme, on the other hand, transforms into a multimodal form of hate speech—a hateful meme—when specific types of text, graphics, or combinations of both are used. The spread of these harmful or undesirable memes has the potential to disrupt societal peace. Therefore, it is vital to limit inappropriate memes on social media. Multimodal hate speech identification is an inherently difficult and open question. It necessitates collaborative language, visual perception, and multimodal reasoning. This line of research has been progressed in this work by building a multi-task learning-based multimodal system for detecting hateful memes by training four hateful meme data sets concurrently. This MTL framework, which consists of Contrastive Language Image Pretraining (CLIP), UNiversal Image-TExt Representation Learning (UNITER), and BERT, was trained collaboratively to transfer common knowledge while simultaneously training four meme datasets. The results show that the recommended strategy outperforms unimodal and multimodal approaches on four multilingual benchmark datasets, with considerable AUC-ROC, accuracy, and F1-score. The ablation studies are undertaken to emphasise the impact of the sub-component in the MTL model. The confusion matrix is shown as quantitative analysis.</div></div>\",\"PeriodicalId\":100944,\"journal\":{\"name\":\"Natural Language Processing Journal\",\"volume\":\"11 \",\"pages\":\"Article 100133\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2025-02-20\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Natural Language Processing Journal\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S2949719125000093\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Natural Language Processing Journal","FirstCategoryId":"1085","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S2949719125000093","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
A transformer based multi task learning approach to multimodal hate speech detection
Online hate speech has become a major social issue in recent years, affecting both individuals and society as a whole. Memes are a multimodal kind of internet hate speech that is growing more common. Online memes are often entertaining and harmless. The seemingly innocent meme, on the other hand, transforms into a multimodal form of hate speech—a hateful meme—when specific types of text, graphics, or combinations of both are used. The spread of these harmful or undesirable memes has the potential to disrupt societal peace. Therefore, it is vital to limit inappropriate memes on social media. Multimodal hate speech identification is an inherently difficult and open question. It necessitates collaborative language, visual perception, and multimodal reasoning. This line of research has been progressed in this work by building a multi-task learning-based multimodal system for detecting hateful memes by training four hateful meme data sets concurrently. This MTL framework, which consists of Contrastive Language Image Pretraining (CLIP), UNiversal Image-TExt Representation Learning (UNITER), and BERT, was trained collaboratively to transfer common knowledge while simultaneously training four meme datasets. The results show that the recommended strategy outperforms unimodal and multimodal approaches on four multilingual benchmark datasets, with considerable AUC-ROC, accuracy, and F1-score. The ablation studies are undertaken to emphasise the impact of the sub-component in the MTL model. The confusion matrix is shown as quantitative analysis.