MISTRA: Misogyny Detection through Text–Image Fusion and Representation Analysis

Nitesh Jindal , Prasanna Kumar Kumaresan , Rahul Ponnusamy , Sajeetha Thavareesan , Saranya Rajiakodi , Bharathi Raja Chakravarthi
{"title":"MISTRA: Misogyny Detection through Text–Image Fusion and Representation Analysis","authors":"Nitesh Jindal ,&nbsp;Prasanna Kumar Kumaresan ,&nbsp;Rahul Ponnusamy ,&nbsp;Sajeetha Thavareesan ,&nbsp;Saranya Rajiakodi ,&nbsp;Bharathi Raja Chakravarthi","doi":"10.1016/j.nlp.2024.100073","DOIUrl":null,"url":null,"abstract":"<div><p>Detecting misogynous memes poses a significant challenge due to the presence of multiple modalities (image + text). The inherent complexity arises from the lack of direct correspondence between the textual and visual elements, where an image and overlaid text often convey disparate meanings. Additionally, memes conveying messages of hatred or taunting, particularly targeted towards women, present additional comprehension difficulties. This article introduces the MISTRA framework, which leverages variational autoencoders for dimensionality reduction of the large-sized image features before fusing multimodal features. The framework also harnesses the capabilities of large language models through transfer learning to develop fusion embeddings by extracting and concatenating features from different modalities (image, text, and image-generated caption text) for the misogynous classification task. The components of the framework include state-of-the-art models such as the Vision Transformer model (ViT), textual model (DistilBERT), CLIP (Contrastive Language–Image Pre-training), and BLIP (Bootstrapping Language–Image Pre-training for Unified Vision-Language Understanding and Generation) models. Our experiments are conducted on the SemEval-2022 Task 5 MAMI dataset. To establish a baseline model, we perform separate experiments using the Naive Bayes machine learning classifier on meme texts and ViT on meme images. We evaluate the performance on six different bootstrap samples and report evaluation metrics such as precision, recall, and Macro-F1 score for each bootstrap sample. Additionally, we compute the confidence interval on our evaluation scores and conduct paired t-tests to understand whether our best-performing model has significant differences from the other experiments or not. The experimental results demonstrate that the dimensionality reduction approach on multimodal features with a multilayer perceptron classifier achieved the highest performance with a Macro–F1 score of 71.5 percent, outperforming the baseline approaches in individual modalities.</p></div>","PeriodicalId":100944,"journal":{"name":"Natural Language Processing Journal","volume":"7 ","pages":"Article 100073"},"PeriodicalIF":0.0000,"publicationDate":"2024-04-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S2949719124000219/pdfft?md5=b1becf6173b99dae8a0f29ea4d466646&pid=1-s2.0-S2949719124000219-main.pdf","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Natural Language Processing Journal","FirstCategoryId":"1085","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S2949719124000219","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

Detecting misogynous memes poses a significant challenge due to the presence of multiple modalities (image + text). The inherent complexity arises from the lack of direct correspondence between the textual and visual elements, where an image and overlaid text often convey disparate meanings. Additionally, memes conveying messages of hatred or taunting, particularly targeted towards women, present additional comprehension difficulties. This article introduces the MISTRA framework, which leverages variational autoencoders for dimensionality reduction of the large-sized image features before fusing multimodal features. The framework also harnesses the capabilities of large language models through transfer learning to develop fusion embeddings by extracting and concatenating features from different modalities (image, text, and image-generated caption text) for the misogynous classification task. The components of the framework include state-of-the-art models such as the Vision Transformer model (ViT), textual model (DistilBERT), CLIP (Contrastive Language–Image Pre-training), and BLIP (Bootstrapping Language–Image Pre-training for Unified Vision-Language Understanding and Generation) models. Our experiments are conducted on the SemEval-2022 Task 5 MAMI dataset. To establish a baseline model, we perform separate experiments using the Naive Bayes machine learning classifier on meme texts and ViT on meme images. We evaluate the performance on six different bootstrap samples and report evaluation metrics such as precision, recall, and Macro-F1 score for each bootstrap sample. Additionally, we compute the confidence interval on our evaluation scores and conduct paired t-tests to understand whether our best-performing model has significant differences from the other experiments or not. The experimental results demonstrate that the dimensionality reduction approach on multimodal features with a multilayer perceptron classifier achieved the highest performance with a Macro–F1 score of 71.5 percent, outperforming the baseline approaches in individual modalities.

MISTRA:通过文本图像融合和表征分析进行厌女症检测
由于存在多种模式(图像+文本),检测厌恶女性的备忘录是一项重大挑战。文字和视觉元素之间缺乏直接对应关系,图像和叠加文字往往表达不同的含义,这就造成了内在的复杂性。此外,传递仇恨或嘲讽信息的备忘录,尤其是针对女性的备忘录,会带来额外的理解困难。本文介绍了 MISTRA 框架,该框架在融合多模态特征之前利用变异自动编码器对大型图像特征进行降维处理。该框架还通过迁移学习利用大型语言模型的能力,通过提取和连接不同模态(图像、文本和图像生成的标题文本)的特征来开发融合嵌入,从而完成误读分类任务。该框架的组成部分包括最先进的模型,如视觉转换器模型(ViT)、文本模型(DistilBERT)、CLIP(对比语言-图像预训练)和BLIP(用于统一视觉语言理解和生成的自举语言-图像预训练)模型。我们的实验是在 SemEval-2022 任务 5 MAMI 数据集上进行的。为了建立基线模型,我们使用 Naive Bayes 机器学习分类器分别对 meme 文本和 ViT 对 meme 图像进行了实验。我们评估了六个不同引导样本的性能,并报告了每个引导样本的精度、召回率和 Macro-F1 分数等评估指标。此外,我们还计算了评估分数的置信区间,并进行了配对 t 检验,以了解表现最佳的模型与其他实验是否存在显著差异。实验结果表明,使用多层感知器分类器对多模态特征进行降维处理的方法取得了最高的性能,Macro-F1 得分为 71.5%,在单个模态上优于基线方法。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信