Xiuxian Wang;Lanjun Wang;Yuting Su;Hongshuo Tian;Guoqing Jin;An-An Liu
{"title":"Few-Shot In-Context Learning for Implicit Semantic Multimodal Content Detection and Interpretation","authors":"Xiuxian Wang;Lanjun Wang;Yuting Su;Hongshuo Tian;Guoqing Jin;An-An Liu","doi":"10.1109/TCSVT.2025.3550900","DOIUrl":null,"url":null,"abstract":"In recent years, the field of explicit semantic multimodal content research makes significant progress. However, research on content with implicit semantics, such as online memes, remains insufficient. Memes often convey implicit semantics through metaphors and may sometimes contain hateful information. To address this issue, researchers propose a task for detecting hateful memes, opening up new avenues for exploring implicit semantics. The hateful meme detection currently faces two main problems: 1) the rapid emergence of meme content makes continuous tracking and detection difficult; 2) current methods often lack interpretability, which limits the understanding and trust in the detection results. To make a better understanding of memes, we analyze the definition of metaphor from social science and identify the three key factors of metaphor: socio-cultural knowledge, metaphorical tenor, and metaphorical representation pattern. According to these key factors, we guide a multimodal large language model (MLLM) to infer the metaphors expressed in memes step by step. Particularly, we propose a hateful meme detection and interpretation framework, which has four modules. We first leverage a multimodal generative search method to obtain socio-cultural knowledge relevant to visual objects of memes. Then, we use socio-cultural knowledge to instruct the MLLM to assess the social-cultural relevance scores between visual objects and textual information, and identify the metaphorical tenor of memes. Meanwhile, we apply a representative interpretation method to provide representative cases of memes and analyze these cases to explore metaphorical representation pattern. Finally, a chain-of-thought prompt is constructed to integrate the output of the above modules, guiding the MLLM to accurately detect and interpret hateful memes. Our method achieves state-of-the-art performance on three hateful meme detection benchmarks and performs better than supervised training models on the hateful meme interpretation benchmark.","PeriodicalId":13082,"journal":{"name":"IEEE Transactions on Circuits and Systems for Video Technology","volume":"35 9","pages":"9545-9558"},"PeriodicalIF":11.1000,"publicationDate":"2025-03-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Circuits and Systems for Video Technology","FirstCategoryId":"5","ListUrlMain":"https://ieeexplore.ieee.org/document/10925399/","RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"ENGINEERING, ELECTRICAL & ELECTRONIC","Score":null,"Total":0}
引用次数: 0
Abstract
In recent years, the field of explicit semantic multimodal content research makes significant progress. However, research on content with implicit semantics, such as online memes, remains insufficient. Memes often convey implicit semantics through metaphors and may sometimes contain hateful information. To address this issue, researchers propose a task for detecting hateful memes, opening up new avenues for exploring implicit semantics. The hateful meme detection currently faces two main problems: 1) the rapid emergence of meme content makes continuous tracking and detection difficult; 2) current methods often lack interpretability, which limits the understanding and trust in the detection results. To make a better understanding of memes, we analyze the definition of metaphor from social science and identify the three key factors of metaphor: socio-cultural knowledge, metaphorical tenor, and metaphorical representation pattern. According to these key factors, we guide a multimodal large language model (MLLM) to infer the metaphors expressed in memes step by step. Particularly, we propose a hateful meme detection and interpretation framework, which has four modules. We first leverage a multimodal generative search method to obtain socio-cultural knowledge relevant to visual objects of memes. Then, we use socio-cultural knowledge to instruct the MLLM to assess the social-cultural relevance scores between visual objects and textual information, and identify the metaphorical tenor of memes. Meanwhile, we apply a representative interpretation method to provide representative cases of memes and analyze these cases to explore metaphorical representation pattern. Finally, a chain-of-thought prompt is constructed to integrate the output of the above modules, guiding the MLLM to accurately detect and interpret hateful memes. Our method achieves state-of-the-art performance on three hateful meme detection benchmarks and performs better than supervised training models on the hateful meme interpretation benchmark.
近年来,显性语义多模态内容研究领域取得了重大进展。然而,对网络模因等具有隐含语义的内容的研究仍然不足。模因通常通过隐喻传达隐含的语义,有时可能包含仇恨信息。为了解决这个问题,研究人员提出了一个检测仇恨模因的任务,为探索隐含语义开辟了新的途径。仇恨模因检测目前面临两个主要问题:1)模因内容的快速涌现给持续跟踪和检测带来困难;2)现有方法往往缺乏可解释性,这限制了对检测结果的理解和信任。为了更好地理解模因,我们从社会科学的角度分析了隐喻的定义,并确定了隐喻的三个关键因素:社会文化知识、隐喻意旨和隐喻表征模式。根据这些关键因素,我们引导多模态大语言模型(multimodal large language model, MLLM)逐步推断模因所表达的隐喻。特别地,我们提出了一个仇恨模因检测和解释框架,它有四个模块。我们首先利用多模态生成搜索方法来获取与模因视觉对象相关的社会文化知识。然后,我们利用社会文化知识指导MLLM评估视觉对象与文本信息之间的社会文化关联得分,并识别模因的隐喻意旨。同时,我们运用代表性解释方法提供模因的代表性案例,并对这些案例进行分析,探索模因的隐喻表征模式。最后,构建思维链提示,整合以上模块的输出,指导MLLM准确检测和解读仇恨模因。我们的方法在三个仇恨模因检测基准上达到了最先进的性能,并且在仇恨模因解释基准上比监督训练模型表现得更好。
期刊介绍:
The IEEE Transactions on Circuits and Systems for Video Technology (TCSVT) is dedicated to covering all aspects of video technologies from a circuits and systems perspective. We encourage submissions of general, theoretical, and application-oriented papers related to image and video acquisition, representation, presentation, and display. Additionally, we welcome contributions in areas such as processing, filtering, and transforms; analysis and synthesis; learning and understanding; compression, transmission, communication, and networking; as well as storage, retrieval, indexing, and search. Furthermore, papers focusing on hardware and software design and implementation are highly valued. Join us in advancing the field of video technology through innovative research and insights.