模因多任务分类的视觉语言模型。

IF 6.3 1区计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Neural Networks Pub Date : 2025-09-12 DOI:10.1016/j.neunet.2025.108089

Md. Mithun Hossain , Md. Shakil Hossain , M.F. Mridha , Nilanjan Dey

{"title":"模因多任务分类的视觉语言模型。","authors":"Md. Mithun Hossain , Md. Shakil Hossain , M.F. Mridha , Nilanjan Dey","doi":"10.1016/j.neunet.2025.108089","DOIUrl":null,"url":null,"abstract":"<div><div>The emergence of social media and online memes has led to an increasing demand for automated systems that can analyse and classify multimodal data, particularly in online forums. Memes blend text and graphics to express complicated ideas, sometimes containing emotions, satire, or inappropriate material. Memes often represent cultural prejudices such as objectification, sexism, and bigotry, making it difficult for artificial intelligence to classify these components. Our solution is the vision-language model ViT-BERT CAMT (cross-attention multitask), which is intended for multitask meme categorization. Our model uses a linear self-attentive fusion mechanism to combine vision transformer (ViT) features for image analysis and bidirectional encoder representations from transformers (BERT) for text interpretation. In this way, we can see how text and images relate to space and meaning. We tested the ViT-BERT CAMT on two difficult datasets: the SemEval 2020 Memotion dataset, which contains a multilabel classification of sentiment, sarcasm, and offensiveness in memes, and the MIMIC dataset, which focuses on detecting sexism, objectification, and prejudice. The findings show that the ViT-BERT CAMT achieves good accuracy on both datasets and outperforms many current baselines in multitask settings. These results highlight the importance of combined image-text modelling for correctly deciphering nuanced meanings in memes, particularly when spotting abusive and discriminatory content. By improving multimodal categorization algorithms, this study helps better monitor and comprehend online conversation.</div></div>","PeriodicalId":49763,"journal":{"name":"Neural Networks","volume":"194 ","pages":"Article 108089"},"PeriodicalIF":6.3000,"publicationDate":"2025-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"A vision-language model for multitask classification of memes\",\"authors\":\"Md. Mithun Hossain , Md. Shakil Hossain , M.F. Mridha , Nilanjan Dey\",\"doi\":\"10.1016/j.neunet.2025.108089\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><div>The emergence of social media and online memes has led to an increasing demand for automated systems that can analyse and classify multimodal data, particularly in online forums. Memes blend text and graphics to express complicated ideas, sometimes containing emotions, satire, or inappropriate material. Memes often represent cultural prejudices such as objectification, sexism, and bigotry, making it difficult for artificial intelligence to classify these components. Our solution is the vision-language model ViT-BERT CAMT (cross-attention multitask), which is intended for multitask meme categorization. Our model uses a linear self-attentive fusion mechanism to combine vision transformer (ViT) features for image analysis and bidirectional encoder representations from transformers (BERT) for text interpretation. In this way, we can see how text and images relate to space and meaning. We tested the ViT-BERT CAMT on two difficult datasets: the SemEval 2020 Memotion dataset, which contains a multilabel classification of sentiment, sarcasm, and offensiveness in memes, and the MIMIC dataset, which focuses on detecting sexism, objectification, and prejudice. The findings show that the ViT-BERT CAMT achieves good accuracy on both datasets and outperforms many current baselines in multitask settings. These results highlight the importance of combined image-text modelling for correctly deciphering nuanced meanings in memes, particularly when spotting abusive and discriminatory content. By improving multimodal categorization algorithms, this study helps better monitor and comprehend online conversation.</div></div>\",\"PeriodicalId\":49763,\"journal\":{\"name\":\"Neural Networks\",\"volume\":\"194 \",\"pages\":\"Article 108089\"},\"PeriodicalIF\":6.3000,\"publicationDate\":\"2025-09-12\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Neural Networks\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S0893608025009694\",\"RegionNum\":1,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Neural Networks","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0893608025009694","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

摘要

社交媒体和网络模因的出现导致对能够分析和分类多模式数据的自动化系统的需求不断增加，特别是在在线论坛中。模因混合文字和图形来表达复杂的想法，有时包含情绪、讽刺或不适当的材料。模因通常代表着物化、性别歧视和偏见等文化偏见，这使得人工智能很难对这些成分进行分类。我们的解决方案是视觉语言模型vitt - bert CAMT (cross-attention multitask)，该模型用于多任务模因分类。我们的模型使用线性自关注融合机制，将用于图像分析的视觉转换器（ViT）特征和用于文本解释的转换器（BERT）双向编码器表示结合起来。通过这种方式，我们可以看到文本和图像是如何与空间和意义联系起来的。我们在两个困难的数据集上测试了vitbert CAMT: SemEval 2020 Memotion数据集，它包含表情包中的情感、讽刺和攻击性的多标签分类，以及MIMIC数据集，它专注于检测性别歧视、物化和偏见。研究结果表明，vitt - bert CAMT在两个数据集上都取得了良好的准确性，并且在多任务设置中优于许多当前基线。这些结果强调了图像-文本组合建模对于正确解读模因中细微差别含义的重要性，特别是在发现滥用和歧视性内容时。通过改进多模态分类算法，本研究有助于更好地监控和理解在线会话。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

A vision-language model for multitask classification of memes

The emergence of social media and online memes has led to an increasing demand for automated systems that can analyse and classify multimodal data, particularly in online forums. Memes blend text and graphics to express complicated ideas, sometimes containing emotions, satire, or inappropriate material. Memes often represent cultural prejudices such as objectification, sexism, and bigotry, making it difficult for artificial intelligence to classify these components. Our solution is the vision-language model ViT-BERT CAMT (cross-attention multitask), which is intended for multitask meme categorization. Our model uses a linear self-attentive fusion mechanism to combine vision transformer (ViT) features for image analysis and bidirectional encoder representations from transformers (BERT) for text interpretation. In this way, we can see how text and images relate to space and meaning. We tested the ViT-BERT CAMT on two difficult datasets: the SemEval 2020 Memotion dataset, which contains a multilabel classification of sentiment, sarcasm, and offensiveness in memes, and the MIMIC dataset, which focuses on detecting sexism, objectification, and prejudice. The findings show that the ViT-BERT CAMT achieves good accuracy on both datasets and outperforms many current baselines in multitask settings. These results highlight the importance of combined image-text modelling for correctly deciphering nuanced meanings in memes, particularly when spotting abusive and discriminatory content. By improving multimodal categorization algorithms, this study helps better monitor and comprehend online conversation.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Neural Networks 工程技术-计算机：人工智能

CiteScore

13.90

自引率

7.70%

发文量

425

审稿时长

67 days

期刊介绍： Neural Networks is a platform that aims to foster an international community of scholars and practitioners interested in neural networks, deep learning, and other approaches to artificial intelligence and machine learning. Our journal invites submissions covering various aspects of neural networks research, from computational neuroscience and cognitive modeling to mathematical analyses and engineering applications. By providing a forum for interdisciplinary discussions between biology and technology, we aim to encourage the development of biologically-inspired artificial intelligence.