Mingying Xu , Fei Hou , Jie Liu , Mengmei Zhang , Lei Shi , Feifei Kou , Lei Guo , Philip S. Yu , Xuming Hu
{"title":"Multimodal named entity recognition in the era of large pre-trained models: A comprehensive survey","authors":"Mingying Xu , Fei Hou , Jie Liu , Mengmei Zhang , Lei Shi , Feifei Kou , Lei Guo , Philip S. Yu , Xuming Hu","doi":"10.1016/j.inffus.2025.103767","DOIUrl":null,"url":null,"abstract":"<div><div>The rapid development of social media, such as Twitter and Facebook, has made tweets an essential resource for various applications, including collecting breaking news, identifying cyber-attacks, and detecting disease outbreaks. As social media becomes increasingly multimodal, Multimodal Named Entity Recognition (MNER) for social media has been widely studied to extract valuable information from tweets and enhance the understanding of tweet content. In recent years, with the development of Large Pre-trained Models (LPMs), many research fields have undergone revolutionary changes, continuously pushing the performance boundaries of various tasks, especially in social media. Tweets exist on a massive scale and involve multiple media types. First, LPMs can effectively handle semantic sparsity and capture richer visual features, providing a more accurate understanding of tweets containing ambiguous or sparse terms. LPMs can provide background knowledge, enhancing the expressiveness of semantically sparse tweets. LPMs possess the ability for cross-modal semantic alignment, enabling them to integrate and optimize the semantic information from both text and images, effectively fusing multimodal features and reducing noise. However, despite significant advantages of LPMs, they still face specific challenges when handling complex tasks, particularly when there is a lack of clear supporting evidence, leading to the generation of erroneous “hallucinated” content. This is primarily due to LPMs’ insufficient contextual support during the fusion of multimodal information, leading to inaccurate reasoning and reducing the model’s reliability. MNER, by integrating multimodal information from both text and images, can provide LPMs with more factual grounding and contextual support, reducing the likelihood of generating hallucinated content and enhancing the reasoning ability and accuracy of LPMs. Therefore, this survey is the first to systematically review the research progress of LPMs in MNER from the perspectives of multimodal representation, multimodal alignment, and multimodal fusion and explores the application of LPMs in MNER. Finally, it summarizes the main challenges that MNER faces and provides an outlook on future development directions for MNER.</div></div>","PeriodicalId":50367,"journal":{"name":"Information Fusion","volume":"127 ","pages":"Article 103767"},"PeriodicalIF":15.5000,"publicationDate":"2025-09-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Information Fusion","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S1566253525008292","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
引用次数: 0
Abstract
The rapid development of social media, such as Twitter and Facebook, has made tweets an essential resource for various applications, including collecting breaking news, identifying cyber-attacks, and detecting disease outbreaks. As social media becomes increasingly multimodal, Multimodal Named Entity Recognition (MNER) for social media has been widely studied to extract valuable information from tweets and enhance the understanding of tweet content. In recent years, with the development of Large Pre-trained Models (LPMs), many research fields have undergone revolutionary changes, continuously pushing the performance boundaries of various tasks, especially in social media. Tweets exist on a massive scale and involve multiple media types. First, LPMs can effectively handle semantic sparsity and capture richer visual features, providing a more accurate understanding of tweets containing ambiguous or sparse terms. LPMs can provide background knowledge, enhancing the expressiveness of semantically sparse tweets. LPMs possess the ability for cross-modal semantic alignment, enabling them to integrate and optimize the semantic information from both text and images, effectively fusing multimodal features and reducing noise. However, despite significant advantages of LPMs, they still face specific challenges when handling complex tasks, particularly when there is a lack of clear supporting evidence, leading to the generation of erroneous “hallucinated” content. This is primarily due to LPMs’ insufficient contextual support during the fusion of multimodal information, leading to inaccurate reasoning and reducing the model’s reliability. MNER, by integrating multimodal information from both text and images, can provide LPMs with more factual grounding and contextual support, reducing the likelihood of generating hallucinated content and enhancing the reasoning ability and accuracy of LPMs. Therefore, this survey is the first to systematically review the research progress of LPMs in MNER from the perspectives of multimodal representation, multimodal alignment, and multimodal fusion and explores the application of LPMs in MNER. Finally, it summarizes the main challenges that MNER faces and provides an outlook on future development directions for MNER.
期刊介绍:
Information Fusion serves as a central platform for showcasing advancements in multi-sensor, multi-source, multi-process information fusion, fostering collaboration among diverse disciplines driving its progress. It is the leading outlet for sharing research and development in this field, focusing on architectures, algorithms, and applications. Papers dealing with fundamental theoretical analyses as well as those demonstrating their application to real-world problems will be welcome.