基于大型语言模型的文本-图像关联预测改进多模态命名实体识别

IF 5.5 2区计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Neurocomputing Pub Date : 2025-07-10 DOI:10.1016/j.neucom.2025.130982

Qingyang Zeng , Minghui Yuan , Yueyang Su , Jia Mi , Qianzi Che , Jing Wan

{"title":"基于大型语言模型的文本-图像关联预测改进多模态命名实体识别","authors":"Qingyang Zeng , Minghui Yuan , Yueyang Su , Jia Mi , Qianzi Che , Jing Wan","doi":"10.1016/j.neucom.2025.130982","DOIUrl":null,"url":null,"abstract":"<div><div>Multimodal Named Entity Recognition (MNER) is a critical task in information extraction, which aims to identify named entities in text-image pairs and classify them into specific types such as person, organization and location. While existing studies have achieved moderate success by fusing visual and textual features through cross-modal attention mechanisms, two major challenges remain: (1) image-text mismatch, where the two modalities are not always semantically aligned in real-world scenarios; and (2) insufficient labeled data, which hampers the model’s ability to learn complex cross-modal associations and limits generalization. To overcome these challenges, we propose a novel framework that leverages the semantic comprehension and reasoning capabilities of Large Language Models (LLMs). Specifically, for the mismatch issue, we employ LLMs to generate the text-image relevance score with inference reason to guide the subsequent modules. Then we design <strong>T</strong>ext-image <strong>R</strong>elationship <strong>P</strong>redicting (TRP) module, which determines the final feature fusion weights based on the relevance score provided by LLMs. To mitigate data scarcity, we prompt LLMs to identify the key entities in text and incorporate them into the original input. Additionally, we design <strong>T</strong>ext-image <strong>R</strong>elevance <strong>F</strong>eatures <strong>L</strong>earning (TRFL) module to construct positive and negative samples based on the relevance score, employing a supervised contrastive learning method to further enhance the model’s ability to extract key features from image-text pairs. Experiments show that our proposed method achieves F1 scores of 75.32 % and 86.65 % on Twitter-2015 and Twitter-2017 datasets, respectively, demonstrating its effectiveness.</div></div>","PeriodicalId":19268,"journal":{"name":"Neurocomputing","volume":"651 ","pages":"Article 130982"},"PeriodicalIF":5.5000,"publicationDate":"2025-07-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Improving multimodal named entity recognition via text-image relevance prediction with large language models\",\"authors\":\"Qingyang Zeng , Minghui Yuan , Yueyang Su , Jia Mi , Qianzi Che , Jing Wan\",\"doi\":\"10.1016/j.neucom.2025.130982\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><div>Multimodal Named Entity Recognition (MNER) is a critical task in information extraction, which aims to identify named entities in text-image pairs and classify them into specific types such as person, organization and location. While existing studies have achieved moderate success by fusing visual and textual features through cross-modal attention mechanisms, two major challenges remain: (1) image-text mismatch, where the two modalities are not always semantically aligned in real-world scenarios; and (2) insufficient labeled data, which hampers the model’s ability to learn complex cross-modal associations and limits generalization. To overcome these challenges, we propose a novel framework that leverages the semantic comprehension and reasoning capabilities of Large Language Models (LLMs). Specifically, for the mismatch issue, we employ LLMs to generate the text-image relevance score with inference reason to guide the subsequent modules. Then we design <strong>T</strong>ext-image <strong>R</strong>elationship <strong>P</strong>redicting (TRP) module, which determines the final feature fusion weights based on the relevance score provided by LLMs. To mitigate data scarcity, we prompt LLMs to identify the key entities in text and incorporate them into the original input. Additionally, we design <strong>T</strong>ext-image <strong>R</strong>elevance <strong>F</strong>eatures <strong>L</strong>earning (TRFL) module to construct positive and negative samples based on the relevance score, employing a supervised contrastive learning method to further enhance the model’s ability to extract key features from image-text pairs. Experiments show that our proposed method achieves F1 scores of 75.32 % and 86.65 % on Twitter-2015 and Twitter-2017 datasets, respectively, demonstrating its effectiveness.</div></div>\",\"PeriodicalId\":19268,\"journal\":{\"name\":\"Neurocomputing\",\"volume\":\"651 \",\"pages\":\"Article 130982\"},\"PeriodicalIF\":5.5000,\"publicationDate\":\"2025-07-10\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Neurocomputing\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S0925231225016546\",\"RegionNum\":2,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Neurocomputing","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0925231225016546","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

摘要

多模态命名实体识别（MNER）是信息提取中的一项关键任务，其目的是识别文本-图像对中的命名实体，并将其分类为特定的类型，如人、组织和地点。虽然现有的研究通过跨模态注意机制融合视觉和文本特征取得了一定的成功，但仍然存在两个主要挑战：(1)图像-文本不匹配，即两种模态在现实场景中并不总是语义一致；(2)标记数据不足，阻碍了模型学习复杂跨模态关联的能力，限制了泛化。为了克服这些挑战，我们提出了一个利用大型语言模型（llm）的语义理解和推理能力的新框架。具体而言，对于不匹配问题，我们使用llm生成带有推理推理的文本-图像相关性评分，以指导后续模块。然后设计文本-图像关系预测（TRP）模块，该模块根据llm提供的相关性评分确定最终的特征融合权重。为了缓解数据稀缺，我们提示法学硕士识别文本中的关键实体，并将其纳入原始输入。此外，我们设计了文本-图像相关特征学习（TRFL）模块，根据相关分数构建正样本和负样本，采用监督对比学习方法进一步增强模型从图像-文本对中提取关键特征的能力。实验表明，本文方法在Twitter-2015和Twitter-2017数据集上分别获得了75.32 %和86.65 %的F1分数，证明了本文方法的有效性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Improving multimodal named entity recognition via text-image relevance prediction with large language models

Multimodal Named Entity Recognition (MNER) is a critical task in information extraction, which aims to identify named entities in text-image pairs and classify them into specific types such as person, organization and location. While existing studies have achieved moderate success by fusing visual and textual features through cross-modal attention mechanisms, two major challenges remain: (1) image-text mismatch, where the two modalities are not always semantically aligned in real-world scenarios; and (2) insufficient labeled data, which hampers the model’s ability to learn complex cross-modal associations and limits generalization. To overcome these challenges, we propose a novel framework that leverages the semantic comprehension and reasoning capabilities of Large Language Models (LLMs). Specifically, for the mismatch issue, we employ LLMs to generate the text-image relevance score with inference reason to guide the subsequent modules. Then we design Text-image Relationship Predicting (TRP) module, which determines the final feature fusion weights based on the relevance score provided by LLMs. To mitigate data scarcity, we prompt LLMs to identify the key entities in text and incorporate them into the original input. Additionally, we design Text-image Relevance Features Learning (TRFL) module to construct positive and negative samples based on the relevance score, employing a supervised contrastive learning method to further enhance the model’s ability to extract key features from image-text pairs. Experiments show that our proposed method achieves F1 scores of 75.32 % and 86.65 % on Twitter-2015 and Twitter-2017 datasets, respectively, demonstrating its effectiveness.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Neurocomputing 工程技术-计算机：人工智能

CiteScore

13.10

自引率

10.00%

发文量

1382

审稿时长

70 days

期刊介绍： Neurocomputing publishes articles describing recent fundamental contributions in the field of neurocomputing. Neurocomputing theory, practice and applications are the essential topics being covered.