MCIRP: A multi-granularity cross-modal interaction model based on relational propagation for Multimodal Named Entity Recognition with multiple images

IF 6.9 1区管理学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS

Information Processing & Management Pub Date : 2025-09-15 DOI:10.1016/j.ipm.2025.104384

Yongheng Mu , Ziyu Guo , Xuewei Li , Lixu Shao , Shijun Liu , Feng Li , Guangxu Mei

{"title":"MCIRP: A multi-granularity cross-modal interaction model based on relational propagation for Multimodal Named Entity Recognition with multiple images","authors":"Yongheng Mu , Ziyu Guo , Xuewei Li , Lixu Shao , Shijun Liu , Feng Li , Guangxu Mei","doi":"10.1016/j.ipm.2025.104384","DOIUrl":null,"url":null,"abstract":"<div><div>Most existing Multimodal Named Entity Recognition (MNER) methods typically focus on processing textual content with a single image and fail to effectively handle content with multiple images. Therefore, MNER with multiple images presents significant research potential. However, current approaches for this task face two key limitations: (1) Treating all images equally without assessing their relevance to the text, which may introduce visual noise from unrelated images; (2) Relying solely on coarse-grained image features while disregarding fine-grained alignments between text and each image. To address the above limitations, this work introduces a novel <u>M</u>ulti-granularity <u>C</u>ross-modal <u>I</u>nteraction Model based on <u>R</u>elational <u>P</u>ropagation (MCIRP), which effectively leverages information from multiple images. For the first limitation, we propose a text–image relation propagation strategy that calculates the correlation score between the text and each image, enabling selective utilization of relevant image information. For the second limitation, we propose a multi-granularity cross-modal interaction fusion technique to facilitate the fusion of text and visual features at different levels of granularity. To the best of our knowledge, this is the first study to explore text–image relation propagation for the MNER task with multiple images. The results show that MCIRP improves the F1 scores on two MNER public datasets with multiple images (MNER-MI and MNER-MI-Plus) by 3.65% and 0.56%, respectively, achieving SOTA performance among existing multi-image methods.</div></div>","PeriodicalId":50365,"journal":{"name":"Information Processing & Management","volume":"63 2","pages":"Article 104384"},"PeriodicalIF":6.9000,"publicationDate":"2025-09-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Information Processing & Management","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0306457325003255","RegionNum":1,"RegionCategory":"管理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}

引用次数: 0

Abstract

Most existing Multimodal Named Entity Recognition (MNER) methods typically focus on processing textual content with a single image and fail to effectively handle content with multiple images. Therefore, MNER with multiple images presents significant research potential. However, current approaches for this task face two key limitations: (1) Treating all images equally without assessing their relevance to the text, which may introduce visual noise from unrelated images; (2) Relying solely on coarse-grained image features while disregarding fine-grained alignments between text and each image. To address the above limitations, this work introduces a novel Multi-granularity Cross-modal Interaction Model based on Relational Propagation (MCIRP), which effectively leverages information from multiple images. For the first limitation, we propose a text–image relation propagation strategy that calculates the correlation score between the text and each image, enabling selective utilization of relevant image information. For the second limitation, we propose a multi-granularity cross-modal interaction fusion technique to facilitate the fusion of text and visual features at different levels of granularity. To the best of our knowledge, this is the first study to explore text–image relation propagation for the MNER task with multiple images. The results show that MCIRP improves the F1 scores on two MNER public datasets with multiple images (MNER-MI and MNER-MI-Plus) by 3.65% and 0.56%, respectively, achieving SOTA performance among existing multi-image methods.

查看原文本刊更多论文

MCIRP：多图像多模态命名实体识别中基于关系传播的多粒度跨模态交互模型

现有的多模态命名实体识别（MNER）方法大多侧重于处理单幅图像的文本内容，而不能有效地处理多幅图像的内容。因此，多图像的MNER具有重要的研究潜力。然而，目前用于这项任务的方法面临两个关键限制：(1)平等地对待所有图像，而不评估它们与文本的相关性，这可能会引入来自不相关图像的视觉噪声；(2)单纯依赖粗粒度的图像特征，忽略了文本与每张图像之间的细粒度对齐。为了解决上述限制，本工作引入了一种基于关系传播的新型多粒度跨模态交互模型（MCIRP），该模型有效地利用了来自多个图像的信息。对于第一个限制，我们提出了一种文本-图像关系传播策略，该策略计算文本和每个图像之间的相关性评分，从而能够选择性地利用相关图像信息。对于第二个限制，我们提出了一种多粒度跨模态交互融合技术，以促进不同粒度级别的文本和视觉特征的融合。据我们所知，这是第一个探索具有多图像的MNER任务的文本-图像关系传播的研究。结果表明，MCIRP在两个多图像MNER公共数据集（MNER- mi和MNER- mi - plus）上的F1分数分别提高了3.65%和0.56%，在现有的多图像方法中达到了SOTA的性能。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Information Processing & Management 工程技术-计算机：信息系统

CiteScore

17.00

自引率

11.60%

发文量

276

审稿时长

39 days

期刊介绍： Information Processing and Management is dedicated to publishing cutting-edge original research at the convergence of computing and information science. Our scope encompasses theory, methods, and applications across various domains, including advertising, business, health, information science, information technology marketing, and social computing. We aim to cater to the interests of both primary researchers and practitioners by offering an effective platform for the timely dissemination of advanced and topical issues in this interdisciplinary field. The journal places particular emphasis on original research articles, research survey articles, research method articles, and articles addressing critical applications of research. Join us in advancing knowledge and innovation at the intersection of computing and information science.