Multimodal Named Entity Recognition based on topic prompt and multi-curriculum denoising

IF 14.7 1区 计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE
Mingying Xu , Kui Peng , Jie Liu , Qing Zhang , Linqi Song , Yinqiao Li
{"title":"Multimodal Named Entity Recognition based on topic prompt and multi-curriculum denoising","authors":"Mingying Xu ,&nbsp;Kui Peng ,&nbsp;Jie Liu ,&nbsp;Qing Zhang ,&nbsp;Linqi Song ,&nbsp;Yinqiao Li","doi":"10.1016/j.inffus.2025.103405","DOIUrl":null,"url":null,"abstract":"<div><div>The rapid development of Generative Large Models (GLMs) such as ChatGPT, GPT4 have significantly enhanced their ability to handle complex tasks and drive innovation across multiple fields, especially in social media field. However, GLMs are prone to generate “hallucinated” content when dealing with ambiguous problems lacking clear evidence, which undermines their reliability. Multimodal Named Entity Recognition (MNER) addresses this issue by integrating image, text and contextual information to establish a fact-based framework, thereby reducing the risk of hallucination and strengthening the reasoning foundation of GLMs. The combination of GLMs and MNER merges the flexibility of content generation with evidence-based constraints, thereby improving reliability and interpretability. In MNER task, weakly related or irrelevant image information introduces noise, which degrades MNER performance. In this paper, we propose a novel framework TPMCLNet, which combines topic prompt with a multi-curriculum denoising strategy. First, the topic prompt module extracts topic information from the images and integrates this image-derived information with the text as auxiliary input, thereby enhancing the model’s understanding of multimodal data. This is particularly useful in cases where the correlation between the image and text is weak, as it provides additional semantic cues to help the model more accurately identify named entities. Additionally, we employ a denoising strategy based on multi-curriculum learning, which defines noise metrics at different granularities to progressively optimize the presentation order of the training data, reducing the impact of noise on the model. Within this framework, we conduct a comprehensive noise assessment of both images and text, gradually introducing cleaner data to improve model training. Experimental results show that, by combining topic prompt with multi-curriculum denoising strategies, TPMCLNet significantly improves MNER performance in complex multimodal environments, demonstrating its effectiveness.</div></div>","PeriodicalId":50367,"journal":{"name":"Information Fusion","volume":"124 ","pages":"Article 103405"},"PeriodicalIF":14.7000,"publicationDate":"2025-06-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Information Fusion","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S1566253525004786","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
引用次数: 0

Abstract

The rapid development of Generative Large Models (GLMs) such as ChatGPT, GPT4 have significantly enhanced their ability to handle complex tasks and drive innovation across multiple fields, especially in social media field. However, GLMs are prone to generate “hallucinated” content when dealing with ambiguous problems lacking clear evidence, which undermines their reliability. Multimodal Named Entity Recognition (MNER) addresses this issue by integrating image, text and contextual information to establish a fact-based framework, thereby reducing the risk of hallucination and strengthening the reasoning foundation of GLMs. The combination of GLMs and MNER merges the flexibility of content generation with evidence-based constraints, thereby improving reliability and interpretability. In MNER task, weakly related or irrelevant image information introduces noise, which degrades MNER performance. In this paper, we propose a novel framework TPMCLNet, which combines topic prompt with a multi-curriculum denoising strategy. First, the topic prompt module extracts topic information from the images and integrates this image-derived information with the text as auxiliary input, thereby enhancing the model’s understanding of multimodal data. This is particularly useful in cases where the correlation between the image and text is weak, as it provides additional semantic cues to help the model more accurately identify named entities. Additionally, we employ a denoising strategy based on multi-curriculum learning, which defines noise metrics at different granularities to progressively optimize the presentation order of the training data, reducing the impact of noise on the model. Within this framework, we conduct a comprehensive noise assessment of both images and text, gradually introducing cleaner data to improve model training. Experimental results show that, by combining topic prompt with multi-curriculum denoising strategies, TPMCLNet significantly improves MNER performance in complex multimodal environments, demonstrating its effectiveness.
基于主题提示和多课程去噪的多模态命名实体识别
ChatGPT、GPT4等生成式大型模型(Generative Large Models, GLMs)的快速发展显著增强了其处理复杂任务和推动跨领域创新的能力,尤其是在社交媒体领域。然而,在处理缺乏明确证据的模糊问题时,glm容易产生“幻觉”内容,这削弱了它们的可靠性。多模态命名实体识别(Multimodal Named Entity Recognition, MNER)通过整合图像、文本和上下文信息建立基于事实的框架来解决这一问题,从而降低了幻觉的风险,加强了glm的推理基础。glm和MNER的结合将内容生成的灵活性与基于证据的约束相结合,从而提高了可靠性和可解释性。在MNER任务中,弱相关或不相关的图像信息会引入噪声,从而降低MNER的性能。在本文中,我们提出了一个新的框架TPMCLNet,它结合了主题提示和多课程去噪策略。首先,主题提示模块从图像中提取主题信息,并将这些图像衍生信息与文本作为辅助输入进行整合,从而增强模型对多模态数据的理解能力。这在图像和文本之间的相关性较弱的情况下特别有用,因为它提供了额外的语义线索,帮助模型更准确地识别命名实体。此外,我们采用了一种基于多课程学习的去噪策略,该策略定义了不同粒度的噪声度量,以逐步优化训练数据的呈现顺序,减少噪声对模型的影响。在这个框架内,我们对图像和文本进行全面的噪声评估,逐步引入更清洁的数据来改进模型训练。实验结果表明,通过将主题提示与多课程去噪策略相结合,TPMCLNet在复杂多模态环境下显著提高了MNER性能,证明了其有效性。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
Information Fusion
Information Fusion 工程技术-计算机:理论方法
CiteScore
33.20
自引率
4.30%
发文量
161
审稿时长
7.9 months
期刊介绍: Information Fusion serves as a central platform for showcasing advancements in multi-sensor, multi-source, multi-process information fusion, fostering collaboration among diverse disciplines driving its progress. It is the leading outlet for sharing research and development in this field, focusing on architectures, algorithms, and applications. Papers dealing with fundamental theoretical analyses as well as those demonstrating their application to real-world problems will be welcome.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信