{"title":"Multimodal heterogeneous graph entity-level fusion for named entity recognition with multi-granularity visual guidance","authors":"Yunchao Gong, Xueqiang Lv, Zhu Yuan, ZhaoJun Wang, Feng Hu, Xindong You","doi":"10.1007/s11227-024-06347-8","DOIUrl":null,"url":null,"abstract":"<p>Multimodal named entity recognition (MNER) is an emerging foundational task in natural language processing. However, existing methods have two main limitations: 1) previous methods have focused on the visual representation of the entire image or target objects. However, they overlook the fine-grained semantic correspondence between entities and visual target objects, or ignore the visual cues of the overall scene and background details in the image. 2) Existing methods have not effectively overcome the semantic gap between different modalities due to the heterogeneity between text and images. To address these issues, we propose a novel multimodal heterogeneous graph entity-level fusion method for MNER (HGMVG) to achieve cross-modal feature interaction from coarse to fine between text and images under the guidance of visual information at different granularities, which can improve the accuracy of named entity recognition. Specifically, to resolve the first issue, we cascade cross-modal semantic interaction information between text and vision at different visual granularities to obtain a comprehensive and effective multimodal representation. For the second issue, we describe the precise semantic correspondences between entity-level words and visual target objects via multimodal heterogeneous graphs, and utilize heterogeneous graph attention networks to achieve cross-modal fine-grained semantic interactions. We conduct extensive experiments on two publicly available Twitter datasets, and the experimental results demonstrate that HGMVG outperforms the current state-of-the-art models in the MNER task.</p>","PeriodicalId":501596,"journal":{"name":"The Journal of Supercomputing","volume":"48 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-07-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"The Journal of Supercomputing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1007/s11227-024-06347-8","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
Multimodal named entity recognition (MNER) is an emerging foundational task in natural language processing. However, existing methods have two main limitations: 1) previous methods have focused on the visual representation of the entire image or target objects. However, they overlook the fine-grained semantic correspondence between entities and visual target objects, or ignore the visual cues of the overall scene and background details in the image. 2) Existing methods have not effectively overcome the semantic gap between different modalities due to the heterogeneity between text and images. To address these issues, we propose a novel multimodal heterogeneous graph entity-level fusion method for MNER (HGMVG) to achieve cross-modal feature interaction from coarse to fine between text and images under the guidance of visual information at different granularities, which can improve the accuracy of named entity recognition. Specifically, to resolve the first issue, we cascade cross-modal semantic interaction information between text and vision at different visual granularities to obtain a comprehensive and effective multimodal representation. For the second issue, we describe the precise semantic correspondences between entity-level words and visual target objects via multimodal heterogeneous graphs, and utilize heterogeneous graph attention networks to achieve cross-modal fine-grained semantic interactions. We conduct extensive experiments on two publicly available Twitter datasets, and the experimental results demonstrate that HGMVG outperforms the current state-of-the-art models in the MNER task.