{"title":"MISTRA: Misogyny Detection through Text–Image Fusion and Representation Analysis","authors":"Nitesh Jindal , Prasanna Kumar Kumaresan , Rahul Ponnusamy , Sajeetha Thavareesan , Saranya Rajiakodi , Bharathi Raja Chakravarthi","doi":"10.1016/j.nlp.2024.100073","DOIUrl":null,"url":null,"abstract":"<div><p>Detecting misogynous memes poses a significant challenge due to the presence of multiple modalities (image + text). The inherent complexity arises from the lack of direct correspondence between the textual and visual elements, where an image and overlaid text often convey disparate meanings. Additionally, memes conveying messages of hatred or taunting, particularly targeted towards women, present additional comprehension difficulties. This article introduces the MISTRA framework, which leverages variational autoencoders for dimensionality reduction of the large-sized image features before fusing multimodal features. The framework also harnesses the capabilities of large language models through transfer learning to develop fusion embeddings by extracting and concatenating features from different modalities (image, text, and image-generated caption text) for the misogynous classification task. The components of the framework include state-of-the-art models such as the Vision Transformer model (ViT), textual model (DistilBERT), CLIP (Contrastive Language–Image Pre-training), and BLIP (Bootstrapping Language–Image Pre-training for Unified Vision-Language Understanding and Generation) models. Our experiments are conducted on the SemEval-2022 Task 5 MAMI dataset. To establish a baseline model, we perform separate experiments using the Naive Bayes machine learning classifier on meme texts and ViT on meme images. We evaluate the performance on six different bootstrap samples and report evaluation metrics such as precision, recall, and Macro-F1 score for each bootstrap sample. Additionally, we compute the confidence interval on our evaluation scores and conduct paired t-tests to understand whether our best-performing model has significant differences from the other experiments or not. The experimental results demonstrate that the dimensionality reduction approach on multimodal features with a multilayer perceptron classifier achieved the highest performance with a Macro–F1 score of 71.5 percent, outperforming the baseline approaches in individual modalities.</p></div>","PeriodicalId":100944,"journal":{"name":"Natural Language Processing Journal","volume":"7 ","pages":"Article 100073"},"PeriodicalIF":0.0000,"publicationDate":"2024-04-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S2949719124000219/pdfft?md5=b1becf6173b99dae8a0f29ea4d466646&pid=1-s2.0-S2949719124000219-main.pdf","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Natural Language Processing Journal","FirstCategoryId":"1085","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S2949719124000219","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
Detecting misogynous memes poses a significant challenge due to the presence of multiple modalities (image + text). The inherent complexity arises from the lack of direct correspondence between the textual and visual elements, where an image and overlaid text often convey disparate meanings. Additionally, memes conveying messages of hatred or taunting, particularly targeted towards women, present additional comprehension difficulties. This article introduces the MISTRA framework, which leverages variational autoencoders for dimensionality reduction of the large-sized image features before fusing multimodal features. The framework also harnesses the capabilities of large language models through transfer learning to develop fusion embeddings by extracting and concatenating features from different modalities (image, text, and image-generated caption text) for the misogynous classification task. The components of the framework include state-of-the-art models such as the Vision Transformer model (ViT), textual model (DistilBERT), CLIP (Contrastive Language–Image Pre-training), and BLIP (Bootstrapping Language–Image Pre-training for Unified Vision-Language Understanding and Generation) models. Our experiments are conducted on the SemEval-2022 Task 5 MAMI dataset. To establish a baseline model, we perform separate experiments using the Naive Bayes machine learning classifier on meme texts and ViT on meme images. We evaluate the performance on six different bootstrap samples and report evaluation metrics such as precision, recall, and Macro-F1 score for each bootstrap sample. Additionally, we compute the confidence interval on our evaluation scores and conduct paired t-tests to understand whether our best-performing model has significant differences from the other experiments or not. The experimental results demonstrate that the dimensionality reduction approach on multimodal features with a multilayer perceptron classifier achieved the highest performance with a Macro–F1 score of 71.5 percent, outperforming the baseline approaches in individual modalities.