大型预训练模型时代的多模态命名实体识别：综述

IF 15.5 1区计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Information Fusion Pub Date : 2025-09-19 DOI:10.1016/j.inffus.2025.103767

Mingying Xu , Fei Hou , Jie Liu , Mengmei Zhang , Lei Shi , Feifei Kou , Lei Guo , Philip S. Yu , Xuming Hu

{"title":"大型预训练模型时代的多模态命名实体识别：综述","authors":"Mingying Xu , Fei Hou , Jie Liu , Mengmei Zhang , Lei Shi , Feifei Kou , Lei Guo , Philip S. Yu , Xuming Hu","doi":"10.1016/j.inffus.2025.103767","DOIUrl":null,"url":null,"abstract":"<div><div>The rapid development of social media, such as Twitter and Facebook, has made tweets an essential resource for various applications, including collecting breaking news, identifying cyber-attacks, and detecting disease outbreaks. As social media becomes increasingly multimodal, Multimodal Named Entity Recognition (MNER) for social media has been widely studied to extract valuable information from tweets and enhance the understanding of tweet content. In recent years, with the development of Large Pre-trained Models (LPMs), many research fields have undergone revolutionary changes, continuously pushing the performance boundaries of various tasks, especially in social media. Tweets exist on a massive scale and involve multiple media types. First, LPMs can effectively handle semantic sparsity and capture richer visual features, providing a more accurate understanding of tweets containing ambiguous or sparse terms. LPMs can provide background knowledge, enhancing the expressiveness of semantically sparse tweets. LPMs possess the ability for cross-modal semantic alignment, enabling them to integrate and optimize the semantic information from both text and images, effectively fusing multimodal features and reducing noise. However, despite significant advantages of LPMs, they still face specific challenges when handling complex tasks, particularly when there is a lack of clear supporting evidence, leading to the generation of erroneous “hallucinated” content. This is primarily due to LPMs’ insufficient contextual support during the fusion of multimodal information, leading to inaccurate reasoning and reducing the model’s reliability. MNER, by integrating multimodal information from both text and images, can provide LPMs with more factual grounding and contextual support, reducing the likelihood of generating hallucinated content and enhancing the reasoning ability and accuracy of LPMs. Therefore, this survey is the first to systematically review the research progress of LPMs in MNER from the perspectives of multimodal representation, multimodal alignment, and multimodal fusion and explores the application of LPMs in MNER. Finally, it summarizes the main challenges that MNER faces and provides an outlook on future development directions for MNER.</div></div>","PeriodicalId":50367,"journal":{"name":"Information Fusion","volume":"127 ","pages":"Article 103767"},"PeriodicalIF":15.5000,"publicationDate":"2025-09-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Multimodal named entity recognition in the era of large pre-trained models: A comprehensive survey\",\"authors\":\"Mingying Xu , Fei Hou , Jie Liu , Mengmei Zhang , Lei Shi , Feifei Kou , Lei Guo , Philip S. Yu , Xuming Hu\",\"doi\":\"10.1016/j.inffus.2025.103767\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><div>The rapid development of social media, such as Twitter and Facebook, has made tweets an essential resource for various applications, including collecting breaking news, identifying cyber-attacks, and detecting disease outbreaks. As social media becomes increasingly multimodal, Multimodal Named Entity Recognition (MNER) for social media has been widely studied to extract valuable information from tweets and enhance the understanding of tweet content. In recent years, with the development of Large Pre-trained Models (LPMs), many research fields have undergone revolutionary changes, continuously pushing the performance boundaries of various tasks, especially in social media. Tweets exist on a massive scale and involve multiple media types. First, LPMs can effectively handle semantic sparsity and capture richer visual features, providing a more accurate understanding of tweets containing ambiguous or sparse terms. LPMs can provide background knowledge, enhancing the expressiveness of semantically sparse tweets. LPMs possess the ability for cross-modal semantic alignment, enabling them to integrate and optimize the semantic information from both text and images, effectively fusing multimodal features and reducing noise. However, despite significant advantages of LPMs, they still face specific challenges when handling complex tasks, particularly when there is a lack of clear supporting evidence, leading to the generation of erroneous “hallucinated” content. This is primarily due to LPMs’ insufficient contextual support during the fusion of multimodal information, leading to inaccurate reasoning and reducing the model’s reliability. MNER, by integrating multimodal information from both text and images, can provide LPMs with more factual grounding and contextual support, reducing the likelihood of generating hallucinated content and enhancing the reasoning ability and accuracy of LPMs. Therefore, this survey is the first to systematically review the research progress of LPMs in MNER from the perspectives of multimodal representation, multimodal alignment, and multimodal fusion and explores the application of LPMs in MNER. Finally, it summarizes the main challenges that MNER faces and provides an outlook on future development directions for MNER.</div></div>\",\"PeriodicalId\":50367,\"journal\":{\"name\":\"Information Fusion\",\"volume\":\"127 \",\"pages\":\"Article 103767\"},\"PeriodicalIF\":15.5000,\"publicationDate\":\"2025-09-19\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Information Fusion\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S1566253525008292\",\"RegionNum\":1,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Information Fusion","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S1566253525008292","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

摘要

推特和脸书等社交媒体的迅速发展，使推文成为各种应用的重要资源，包括收集突发新闻、识别网络攻击和检测疾病爆发。随着社交媒体的日益多模态化，针对社交媒体的多模态命名实体识别（multimodal Named Entity Recognition， MNER）被广泛研究，以从推文中提取有价值的信息，增强对推文内容的理解。近年来，随着大型预训练模型（Large Pre-trained Models, lpm）的发展，许多研究领域都发生了革命性的变化，不断推动着各种任务的性能边界，尤其是在社交媒体领域。推文的存在规模庞大，涉及多种媒体类型。首先，lpm可以有效地处理语义稀疏性并捕获更丰富的视觉特征，从而更准确地理解包含模糊或稀疏术语的tweet。lpm可以提供背景知识，增强语义稀疏tweets的表达能力。lpm具有跨模态语义对齐能力，使其能够整合和优化文本和图像的语义信息，有效融合多模态特征并降低噪声。然而，尽管lpm具有显著的优势，但它们在处理复杂任务时仍然面临着特定的挑战，特别是在缺乏明确支持证据的情况下，导致错误的“幻觉”内容的产生。这主要是由于lpm在多模态信息融合过程中上下文支持不足，导致推理不准确，降低了模型的可靠性。MNER通过整合文本和图像的多模态信息，可以为lpm提供更多的事实基础和上下文支持，减少产生幻觉内容的可能性，提高lpm的推理能力和准确性。因此，本文首次从多模态表示、多模态对齐、多模态融合等方面系统回顾了lpm在多模态映射中的研究进展，并探讨了lpm在多模态映射中的应用。最后，总结了MNER面临的主要挑战，并对未来的发展方向进行了展望。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Multimodal named entity recognition in the era of large pre-trained models: A comprehensive survey

The rapid development of social media, such as Twitter and Facebook, has made tweets an essential resource for various applications, including collecting breaking news, identifying cyber-attacks, and detecting disease outbreaks. As social media becomes increasingly multimodal, Multimodal Named Entity Recognition (MNER) for social media has been widely studied to extract valuable information from tweets and enhance the understanding of tweet content. In recent years, with the development of Large Pre-trained Models (LPMs), many research fields have undergone revolutionary changes, continuously pushing the performance boundaries of various tasks, especially in social media. Tweets exist on a massive scale and involve multiple media types. First, LPMs can effectively handle semantic sparsity and capture richer visual features, providing a more accurate understanding of tweets containing ambiguous or sparse terms. LPMs can provide background knowledge, enhancing the expressiveness of semantically sparse tweets. LPMs possess the ability for cross-modal semantic alignment, enabling them to integrate and optimize the semantic information from both text and images, effectively fusing multimodal features and reducing noise. However, despite significant advantages of LPMs, they still face specific challenges when handling complex tasks, particularly when there is a lack of clear supporting evidence, leading to the generation of erroneous “hallucinated” content. This is primarily due to LPMs’ insufficient contextual support during the fusion of multimodal information, leading to inaccurate reasoning and reducing the model’s reliability. MNER, by integrating multimodal information from both text and images, can provide LPMs with more factual grounding and contextual support, reducing the likelihood of generating hallucinated content and enhancing the reasoning ability and accuracy of LPMs. Therefore, this survey is the first to systematically review the research progress of LPMs in MNER from the perspectives of multimodal representation, multimodal alignment, and multimodal fusion and explores the application of LPMs in MNER. Finally, it summarizes the main challenges that MNER faces and provides an outlook on future development directions for MNER.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Information Fusion 工程技术-计算机：理论方法

CiteScore

33.20

自引率

4.30%

发文量

161

审稿时长

7.9 months

期刊介绍： Information Fusion serves as a central platform for showcasing advancements in multi-sensor, multi-source, multi-process information fusion, fostering collaboration among diverse disciplines driving its progress. It is the leading outlet for sharing research and development in this field, focusing on architectures, algorithms, and applications. Papers dealing with fundamental theoretical analyses as well as those demonstrating their application to real-world problems will be welcome.