Multimodal Named Entity Recognition based on topic prompt and multi-curriculum denoising

IF 14.7 1区计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Information Fusion Pub Date : 2025-06-21 DOI:10.1016/j.inffus.2025.103405

Mingying Xu , Kui Peng , Jie Liu , Qing Zhang , Linqi Song , Yinqiao Li

{"title":"Multimodal Named Entity Recognition based on topic prompt and multi-curriculum denoising","authors":"Mingying Xu , Kui Peng , Jie Liu , Qing Zhang , Linqi Song , Yinqiao Li","doi":"10.1016/j.inffus.2025.103405","DOIUrl":null,"url":null,"abstract":"<div><div>The rapid development of Generative Large Models (GLMs) such as ChatGPT, GPT4 have significantly enhanced their ability to handle complex tasks and drive innovation across multiple fields, especially in social media field. However, GLMs are prone to generate “hallucinated” content when dealing with ambiguous problems lacking clear evidence, which undermines their reliability. Multimodal Named Entity Recognition (MNER) addresses this issue by integrating image, text and contextual information to establish a fact-based framework, thereby reducing the risk of hallucination and strengthening the reasoning foundation of GLMs. The combination of GLMs and MNER merges the flexibility of content generation with evidence-based constraints, thereby improving reliability and interpretability. In MNER task, weakly related or irrelevant image information introduces noise, which degrades MNER performance. In this paper, we propose a novel framework TPMCLNet, which combines topic prompt with a multi-curriculum denoising strategy. First, the topic prompt module extracts topic information from the images and integrates this image-derived information with the text as auxiliary input, thereby enhancing the model’s understanding of multimodal data. This is particularly useful in cases where the correlation between the image and text is weak, as it provides additional semantic cues to help the model more accurately identify named entities. Additionally, we employ a denoising strategy based on multi-curriculum learning, which defines noise metrics at different granularities to progressively optimize the presentation order of the training data, reducing the impact of noise on the model. Within this framework, we conduct a comprehensive noise assessment of both images and text, gradually introducing cleaner data to improve model training. Experimental results show that, by combining topic prompt with multi-curriculum denoising strategies, TPMCLNet significantly improves MNER performance in complex multimodal environments, demonstrating its effectiveness.</div></div>","PeriodicalId":50367,"journal":{"name":"Information Fusion","volume":"124 ","pages":"Article 103405"},"PeriodicalIF":14.7000,"publicationDate":"2025-06-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Information Fusion","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S1566253525004786","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

Abstract

The rapid development of Generative Large Models (GLMs) such as ChatGPT, GPT4 have significantly enhanced their ability to handle complex tasks and drive innovation across multiple fields, especially in social media field. However, GLMs are prone to generate “hallucinated” content when dealing with ambiguous problems lacking clear evidence, which undermines their reliability. Multimodal Named Entity Recognition (MNER) addresses this issue by integrating image, text and contextual information to establish a fact-based framework, thereby reducing the risk of hallucination and strengthening the reasoning foundation of GLMs. The combination of GLMs and MNER merges the flexibility of content generation with evidence-based constraints, thereby improving reliability and interpretability. In MNER task, weakly related or irrelevant image information introduces noise, which degrades MNER performance. In this paper, we propose a novel framework TPMCLNet, which combines topic prompt with a multi-curriculum denoising strategy. First, the topic prompt module extracts topic information from the images and integrates this image-derived information with the text as auxiliary input, thereby enhancing the model’s understanding of multimodal data. This is particularly useful in cases where the correlation between the image and text is weak, as it provides additional semantic cues to help the model more accurately identify named entities. Additionally, we employ a denoising strategy based on multi-curriculum learning, which defines noise metrics at different granularities to progressively optimize the presentation order of the training data, reducing the impact of noise on the model. Within this framework, we conduct a comprehensive noise assessment of both images and text, gradually introducing cleaner data to improve model training. Experimental results show that, by combining topic prompt with multi-curriculum denoising strategies, TPMCLNet significantly improves MNER performance in complex multimodal environments, demonstrating its effectiveness.

查看原文本刊更多论文

基于主题提示和多课程去噪的多模态命名实体识别

ChatGPT、GPT4等生成式大型模型（Generative Large Models, GLMs）的快速发展显著增强了其处理复杂任务和推动跨领域创新的能力，尤其是在社交媒体领域。然而，在处理缺乏明确证据的模糊问题时，glm容易产生“幻觉”内容，这削弱了它们的可靠性。多模态命名实体识别（Multimodal Named Entity Recognition， MNER）通过整合图像、文本和上下文信息建立基于事实的框架来解决这一问题，从而降低了幻觉的风险，加强了glm的推理基础。glm和MNER的结合将内容生成的灵活性与基于证据的约束相结合，从而提高了可靠性和可解释性。在MNER任务中，弱相关或不相关的图像信息会引入噪声，从而降低MNER的性能。在本文中，我们提出了一个新的框架TPMCLNet，它结合了主题提示和多课程去噪策略。首先，主题提示模块从图像中提取主题信息，并将这些图像衍生信息与文本作为辅助输入进行整合，从而增强模型对多模态数据的理解能力。这在图像和文本之间的相关性较弱的情况下特别有用，因为它提供了额外的语义线索，帮助模型更准确地识别命名实体。此外，我们采用了一种基于多课程学习的去噪策略，该策略定义了不同粒度的噪声度量，以逐步优化训练数据的呈现顺序，减少噪声对模型的影响。在这个框架内，我们对图像和文本进行全面的噪声评估，逐步引入更清洁的数据来改进模型训练。实验结果表明，通过将主题提示与多课程去噪策略相结合，TPMCLNet在复杂多模态环境下显著提高了MNER性能，证明了其有效性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Information Fusion 工程技术-计算机：理论方法

CiteScore

33.20

自引率

4.30%

发文量

161

审稿时长

7.9 months

期刊介绍： Information Fusion serves as a central platform for showcasing advancements in multi-sensor, multi-source, multi-process information fusion, fostering collaboration among diverse disciplines driving its progress. It is the leading outlet for sharing research and development in this field, focusing on architectures, algorithms, and applications. Papers dealing with fundamental theoretical analyses as well as those demonstrating their application to real-world problems will be welcome.