Multimodal heterogeneous graph entity-level fusion for named entity recognition with multi-granularity visual guidance

The Journal of Supercomputing Pub Date : 2024-07-22 DOI:10.1007/s11227-024-06347-8

Yunchao Gong, Xueqiang Lv, Zhu Yuan, ZhaoJun Wang, Feng Hu, Xindong You

{"title":"Multimodal heterogeneous graph entity-level fusion for named entity recognition with multi-granularity visual guidance","authors":"Yunchao Gong, Xueqiang Lv, Zhu Yuan, ZhaoJun Wang, Feng Hu, Xindong You","doi":"10.1007/s11227-024-06347-8","DOIUrl":null,"url":null,"abstract":"<p>Multimodal named entity recognition (MNER) is an emerging foundational task in natural language processing. However, existing methods have two main limitations: 1) previous methods have focused on the visual representation of the entire image or target objects. However, they overlook the fine-grained semantic correspondence between entities and visual target objects, or ignore the visual cues of the overall scene and background details in the image. 2) Existing methods have not effectively overcome the semantic gap between different modalities due to the heterogeneity between text and images. To address these issues, we propose a novel multimodal heterogeneous graph entity-level fusion method for MNER (HGMVG) to achieve cross-modal feature interaction from coarse to fine between text and images under the guidance of visual information at different granularities, which can improve the accuracy of named entity recognition. Specifically, to resolve the first issue, we cascade cross-modal semantic interaction information between text and vision at different visual granularities to obtain a comprehensive and effective multimodal representation. For the second issue, we describe the precise semantic correspondences between entity-level words and visual target objects via multimodal heterogeneous graphs, and utilize heterogeneous graph attention networks to achieve cross-modal fine-grained semantic interactions. We conduct extensive experiments on two publicly available Twitter datasets, and the experimental results demonstrate that HGMVG outperforms the current state-of-the-art models in the MNER task.</p>","PeriodicalId":501596,"journal":{"name":"The Journal of Supercomputing","volume":"48 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-07-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"The Journal of Supercomputing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1007/s11227-024-06347-8","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Multimodal named entity recognition (MNER) is an emerging foundational task in natural language processing. However, existing methods have two main limitations: 1) previous methods have focused on the visual representation of the entire image or target objects. However, they overlook the fine-grained semantic correspondence between entities and visual target objects, or ignore the visual cues of the overall scene and background details in the image. 2) Existing methods have not effectively overcome the semantic gap between different modalities due to the heterogeneity between text and images. To address these issues, we propose a novel multimodal heterogeneous graph entity-level fusion method for MNER (HGMVG) to achieve cross-modal feature interaction from coarse to fine between text and images under the guidance of visual information at different granularities, which can improve the accuracy of named entity recognition. Specifically, to resolve the first issue, we cascade cross-modal semantic interaction information between text and vision at different visual granularities to obtain a comprehensive and effective multimodal representation. For the second issue, we describe the precise semantic correspondences between entity-level words and visual target objects via multimodal heterogeneous graphs, and utilize heterogeneous graph attention networks to achieve cross-modal fine-grained semantic interactions. We conduct extensive experiments on two publicly available Twitter datasets, and the experimental results demonstrate that HGMVG outperforms the current state-of-the-art models in the MNER task.

Abstract Image

查看原文本刊更多论文

利用多粒度视觉引导进行命名实体识别的多模态异构图实体级融合

多模态命名实体识别（MNER）是自然语言处理中一项新兴的基础任务。然而，现有方法有两个主要局限：1) 以往的方法侧重于整个图像或目标对象的视觉表示。但是，它们忽略了实体与视觉目标对象之间的细粒度语义对应关系，或者忽略了图像中整体场景和背景细节的视觉线索。2) 由于文本和图像之间的异质性，现有方法无法有效克服不同模态之间的语义差距。针对这些问题，我们提出了一种新颖的 MNER 多模态异构图实体级融合方法（HGMVG），在不同粒度的视觉信息指导下，实现文本与图像之间从粗到细的跨模态特征交互，从而提高命名实体识别的准确率。具体来说，为了解决第一个问题，我们在不同的视觉粒度上级联文本与视觉之间的跨模态语义交互信息，从而获得全面有效的多模态表征。针对第二个问题，我们通过多模态异构图描述了实体层词语与视觉目标对象之间的精确语义对应关系，并利用异构图注意力网络实现了跨模态细粒度语义交互。我们在两个公开的 Twitter 数据集上进行了大量实验，实验结果表明 HGMVG 在 MNER 任务中的表现优于目前最先进的模型。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

The Journal of Supercomputing

自引率

0.00%

发文量