大型预训练模型时代的多模态命名实体识别:综述

IF 15.5 1区 计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE
Mingying Xu , Fei Hou , Jie Liu , Mengmei Zhang , Lei Shi , Feifei Kou , Lei Guo , Philip S. Yu , Xuming Hu
{"title":"大型预训练模型时代的多模态命名实体识别:综述","authors":"Mingying Xu ,&nbsp;Fei Hou ,&nbsp;Jie Liu ,&nbsp;Mengmei Zhang ,&nbsp;Lei Shi ,&nbsp;Feifei Kou ,&nbsp;Lei Guo ,&nbsp;Philip S. Yu ,&nbsp;Xuming Hu","doi":"10.1016/j.inffus.2025.103767","DOIUrl":null,"url":null,"abstract":"<div><div>The rapid development of social media, such as Twitter and Facebook, has made tweets an essential resource for various applications, including collecting breaking news, identifying cyber-attacks, and detecting disease outbreaks. As social media becomes increasingly multimodal, Multimodal Named Entity Recognition (MNER) for social media has been widely studied to extract valuable information from tweets and enhance the understanding of tweet content. In recent years, with the development of Large Pre-trained Models (LPMs), many research fields have undergone revolutionary changes, continuously pushing the performance boundaries of various tasks, especially in social media. Tweets exist on a massive scale and involve multiple media types. First, LPMs can effectively handle semantic sparsity and capture richer visual features, providing a more accurate understanding of tweets containing ambiguous or sparse terms. LPMs can provide background knowledge, enhancing the expressiveness of semantically sparse tweets. LPMs possess the ability for cross-modal semantic alignment, enabling them to integrate and optimize the semantic information from both text and images, effectively fusing multimodal features and reducing noise. However, despite significant advantages of LPMs, they still face specific challenges when handling complex tasks, particularly when there is a lack of clear supporting evidence, leading to the generation of erroneous “hallucinated” content. This is primarily due to LPMs’ insufficient contextual support during the fusion of multimodal information, leading to inaccurate reasoning and reducing the model’s reliability. MNER, by integrating multimodal information from both text and images, can provide LPMs with more factual grounding and contextual support, reducing the likelihood of generating hallucinated content and enhancing the reasoning ability and accuracy of LPMs. Therefore, this survey is the first to systematically review the research progress of LPMs in MNER from the perspectives of multimodal representation, multimodal alignment, and multimodal fusion and explores the application of LPMs in MNER. Finally, it summarizes the main challenges that MNER faces and provides an outlook on future development directions for MNER.</div></div>","PeriodicalId":50367,"journal":{"name":"Information Fusion","volume":"127 ","pages":"Article 103767"},"PeriodicalIF":15.5000,"publicationDate":"2025-09-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Multimodal named entity recognition in the era of large pre-trained models: A comprehensive survey\",\"authors\":\"Mingying Xu ,&nbsp;Fei Hou ,&nbsp;Jie Liu ,&nbsp;Mengmei Zhang ,&nbsp;Lei Shi ,&nbsp;Feifei Kou ,&nbsp;Lei Guo ,&nbsp;Philip S. Yu ,&nbsp;Xuming Hu\",\"doi\":\"10.1016/j.inffus.2025.103767\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><div>The rapid development of social media, such as Twitter and Facebook, has made tweets an essential resource for various applications, including collecting breaking news, identifying cyber-attacks, and detecting disease outbreaks. As social media becomes increasingly multimodal, Multimodal Named Entity Recognition (MNER) for social media has been widely studied to extract valuable information from tweets and enhance the understanding of tweet content. In recent years, with the development of Large Pre-trained Models (LPMs), many research fields have undergone revolutionary changes, continuously pushing the performance boundaries of various tasks, especially in social media. Tweets exist on a massive scale and involve multiple media types. First, LPMs can effectively handle semantic sparsity and capture richer visual features, providing a more accurate understanding of tweets containing ambiguous or sparse terms. LPMs can provide background knowledge, enhancing the expressiveness of semantically sparse tweets. LPMs possess the ability for cross-modal semantic alignment, enabling them to integrate and optimize the semantic information from both text and images, effectively fusing multimodal features and reducing noise. However, despite significant advantages of LPMs, they still face specific challenges when handling complex tasks, particularly when there is a lack of clear supporting evidence, leading to the generation of erroneous “hallucinated” content. This is primarily due to LPMs’ insufficient contextual support during the fusion of multimodal information, leading to inaccurate reasoning and reducing the model’s reliability. MNER, by integrating multimodal information from both text and images, can provide LPMs with more factual grounding and contextual support, reducing the likelihood of generating hallucinated content and enhancing the reasoning ability and accuracy of LPMs. Therefore, this survey is the first to systematically review the research progress of LPMs in MNER from the perspectives of multimodal representation, multimodal alignment, and multimodal fusion and explores the application of LPMs in MNER. Finally, it summarizes the main challenges that MNER faces and provides an outlook on future development directions for MNER.</div></div>\",\"PeriodicalId\":50367,\"journal\":{\"name\":\"Information Fusion\",\"volume\":\"127 \",\"pages\":\"Article 103767\"},\"PeriodicalIF\":15.5000,\"publicationDate\":\"2025-09-19\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Information Fusion\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S1566253525008292\",\"RegionNum\":1,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Information Fusion","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S1566253525008292","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
引用次数: 0

摘要

推特和脸书等社交媒体的迅速发展,使推文成为各种应用的重要资源,包括收集突发新闻、识别网络攻击和检测疾病爆发。随着社交媒体的日益多模态化,针对社交媒体的多模态命名实体识别(multimodal Named Entity Recognition, MNER)被广泛研究,以从推文中提取有价值的信息,增强对推文内容的理解。近年来,随着大型预训练模型(Large Pre-trained Models, lpm)的发展,许多研究领域都发生了革命性的变化,不断推动着各种任务的性能边界,尤其是在社交媒体领域。推文的存在规模庞大,涉及多种媒体类型。首先,lpm可以有效地处理语义稀疏性并捕获更丰富的视觉特征,从而更准确地理解包含模糊或稀疏术语的tweet。lpm可以提供背景知识,增强语义稀疏tweets的表达能力。lpm具有跨模态语义对齐能力,使其能够整合和优化文本和图像的语义信息,有效融合多模态特征并降低噪声。然而,尽管lpm具有显著的优势,但它们在处理复杂任务时仍然面临着特定的挑战,特别是在缺乏明确支持证据的情况下,导致错误的“幻觉”内容的产生。这主要是由于lpm在多模态信息融合过程中上下文支持不足,导致推理不准确,降低了模型的可靠性。MNER通过整合文本和图像的多模态信息,可以为lpm提供更多的事实基础和上下文支持,减少产生幻觉内容的可能性,提高lpm的推理能力和准确性。因此,本文首次从多模态表示、多模态对齐、多模态融合等方面系统回顾了lpm在多模态映射中的研究进展,并探讨了lpm在多模态映射中的应用。最后,总结了MNER面临的主要挑战,并对未来的发展方向进行了展望。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
Multimodal named entity recognition in the era of large pre-trained models: A comprehensive survey
The rapid development of social media, such as Twitter and Facebook, has made tweets an essential resource for various applications, including collecting breaking news, identifying cyber-attacks, and detecting disease outbreaks. As social media becomes increasingly multimodal, Multimodal Named Entity Recognition (MNER) for social media has been widely studied to extract valuable information from tweets and enhance the understanding of tweet content. In recent years, with the development of Large Pre-trained Models (LPMs), many research fields have undergone revolutionary changes, continuously pushing the performance boundaries of various tasks, especially in social media. Tweets exist on a massive scale and involve multiple media types. First, LPMs can effectively handle semantic sparsity and capture richer visual features, providing a more accurate understanding of tweets containing ambiguous or sparse terms. LPMs can provide background knowledge, enhancing the expressiveness of semantically sparse tweets. LPMs possess the ability for cross-modal semantic alignment, enabling them to integrate and optimize the semantic information from both text and images, effectively fusing multimodal features and reducing noise. However, despite significant advantages of LPMs, they still face specific challenges when handling complex tasks, particularly when there is a lack of clear supporting evidence, leading to the generation of erroneous “hallucinated” content. This is primarily due to LPMs’ insufficient contextual support during the fusion of multimodal information, leading to inaccurate reasoning and reducing the model’s reliability. MNER, by integrating multimodal information from both text and images, can provide LPMs with more factual grounding and contextual support, reducing the likelihood of generating hallucinated content and enhancing the reasoning ability and accuracy of LPMs. Therefore, this survey is the first to systematically review the research progress of LPMs in MNER from the perspectives of multimodal representation, multimodal alignment, and multimodal fusion and explores the application of LPMs in MNER. Finally, it summarizes the main challenges that MNER faces and provides an outlook on future development directions for MNER.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
Information Fusion
Information Fusion 工程技术-计算机:理论方法
CiteScore
33.20
自引率
4.30%
发文量
161
审稿时长
7.9 months
期刊介绍: Information Fusion serves as a central platform for showcasing advancements in multi-sensor, multi-source, multi-process information fusion, fostering collaboration among diverse disciplines driving its progress. It is the leading outlet for sharing research and development in this field, focusing on architectures, algorithms, and applications. Papers dealing with fundamental theoretical analyses as well as those demonstrating their application to real-world problems will be welcome.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信