Hua Zhang , Xianlv Liang , Wanxiang Cai , Pengliang Chen , Bi Chen , Bo Jiang , Ye Wang
{"title":"基于llm的多模态命名实体识别桥接演示和多源注意融合","authors":"Hua Zhang , Xianlv Liang , Wanxiang Cai , Pengliang Chen , Bi Chen , Bo Jiang , Ye Wang","doi":"10.1016/j.inffus.2025.103800","DOIUrl":null,"url":null,"abstract":"<div><div>Grounded multimodal named entity recognition (GMNER) is a challenging and emerging task that aims to identify all entity-type-region triplets from multimodal image-text pairs. Existing approaches often struggle with insufficient interaction between named entities and visual regions, leading to difficulties in accurate triplet alignment, cross-modal entity disambiguation, and visual semantic grounding. To tackle these challenges, we present a novel two-stage GMNER framework that integrates demonstration retrieval and multi-source cross-layer attention fusion. The initial stage for MNER employs entity-aware attention mechanism to select task-relevant demonstration examples, enabling large language models (LLMs) to generate high-quality external knowledge. The subsequent stage for visual grounding implements a sufficient cross-modal semantic interaction by introducing the multi-source multi-head cross-layer attention fusion (MMCAF) module, which integrates multi-source inputs (raw text, named and visual entity expressions, and image captions). Meanwhile, within this two-stage framework, we adopt a dual-LLM architecture using both text and vision LLMs, aiming to separate the generation of semantic priors from visual-language alignment and bridge gaps in cross-modal understanding. Our model achieves state-of-the-art performance across two GMNER datasets (Twitter-GMNER and Twitter-FMNERG) with different granularity, and further demonstrates superiority in ablation experiments and cross-domain evaluation.</div></div>","PeriodicalId":50367,"journal":{"name":"Information Fusion","volume":"127 ","pages":"Article 103800"},"PeriodicalIF":15.5000,"publicationDate":"2025-10-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Bridging demonstration and multi-source attention fusion using LLMs for grounded multimodal named entity recognition\",\"authors\":\"Hua Zhang , Xianlv Liang , Wanxiang Cai , Pengliang Chen , Bi Chen , Bo Jiang , Ye Wang\",\"doi\":\"10.1016/j.inffus.2025.103800\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><div>Grounded multimodal named entity recognition (GMNER) is a challenging and emerging task that aims to identify all entity-type-region triplets from multimodal image-text pairs. Existing approaches often struggle with insufficient interaction between named entities and visual regions, leading to difficulties in accurate triplet alignment, cross-modal entity disambiguation, and visual semantic grounding. To tackle these challenges, we present a novel two-stage GMNER framework that integrates demonstration retrieval and multi-source cross-layer attention fusion. The initial stage for MNER employs entity-aware attention mechanism to select task-relevant demonstration examples, enabling large language models (LLMs) to generate high-quality external knowledge. The subsequent stage for visual grounding implements a sufficient cross-modal semantic interaction by introducing the multi-source multi-head cross-layer attention fusion (MMCAF) module, which integrates multi-source inputs (raw text, named and visual entity expressions, and image captions). Meanwhile, within this two-stage framework, we adopt a dual-LLM architecture using both text and vision LLMs, aiming to separate the generation of semantic priors from visual-language alignment and bridge gaps in cross-modal understanding. Our model achieves state-of-the-art performance across two GMNER datasets (Twitter-GMNER and Twitter-FMNERG) with different granularity, and further demonstrates superiority in ablation experiments and cross-domain evaluation.</div></div>\",\"PeriodicalId\":50367,\"journal\":{\"name\":\"Information Fusion\",\"volume\":\"127 \",\"pages\":\"Article 103800\"},\"PeriodicalIF\":15.5000,\"publicationDate\":\"2025-10-02\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Information Fusion\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S1566253525008620\",\"RegionNum\":1,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Information Fusion","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S1566253525008620","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
Bridging demonstration and multi-source attention fusion using LLMs for grounded multimodal named entity recognition
Grounded multimodal named entity recognition (GMNER) is a challenging and emerging task that aims to identify all entity-type-region triplets from multimodal image-text pairs. Existing approaches often struggle with insufficient interaction between named entities and visual regions, leading to difficulties in accurate triplet alignment, cross-modal entity disambiguation, and visual semantic grounding. To tackle these challenges, we present a novel two-stage GMNER framework that integrates demonstration retrieval and multi-source cross-layer attention fusion. The initial stage for MNER employs entity-aware attention mechanism to select task-relevant demonstration examples, enabling large language models (LLMs) to generate high-quality external knowledge. The subsequent stage for visual grounding implements a sufficient cross-modal semantic interaction by introducing the multi-source multi-head cross-layer attention fusion (MMCAF) module, which integrates multi-source inputs (raw text, named and visual entity expressions, and image captions). Meanwhile, within this two-stage framework, we adopt a dual-LLM architecture using both text and vision LLMs, aiming to separate the generation of semantic priors from visual-language alignment and bridge gaps in cross-modal understanding. Our model achieves state-of-the-art performance across two GMNER datasets (Twitter-GMNER and Twitter-FMNERG) with different granularity, and further demonstrates superiority in ablation experiments and cross-domain evaluation.
期刊介绍:
Information Fusion serves as a central platform for showcasing advancements in multi-sensor, multi-source, multi-process information fusion, fostering collaboration among diverse disciplines driving its progress. It is the leading outlet for sharing research and development in this field, focusing on architectures, algorithms, and applications. Papers dealing with fundamental theoretical analyses as well as those demonstrating their application to real-world problems will be welcome.