The more quality information the better: Hierarchical generation of multi-evidence alignment and fusion model for multimodal entity and relation extraction
IF 7.4 1区 管理学Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS
{"title":"The more quality information the better: Hierarchical generation of multi-evidence alignment and fusion model for multimodal entity and relation extraction","authors":"Xinyu He, Shixin Li, Yuning Zhang, Binhe Li, Sifan Xu, Yuqing Zhou","doi":"10.1016/j.ipm.2024.103875","DOIUrl":null,"url":null,"abstract":"<div><p>Multimodal Entity and Relation Extraction (MERE) encompasses tasks, including Multimodal Named Entity Recognition (MNER) and Multimodal Relation Extraction (MRE), aiming to extract valuable information from environments rich in multimodal data. Currently, many research endeavors face various challenges, including the insufficient utilization of emotional information in multimodal data, mismatches between textual and visual content, ambiguous meanings, and difficulties achieving precise alignment across different semantic levels. To address these issues, we propose the <strong>H</strong>ierarchical <strong>G</strong>eneration of <strong>M</strong>ulti Evidence <strong>A</strong>lignment <strong>F</strong>usion Model for Multimodal Entity and Relation Extraction (HGMAF). This model comprises a hierarchical diffusion semantic generation stage and a multi-evidence alignment fusion module. Initially, we designed different prompt templates for the original text, using the Large Language Model (LLM) to generate corresponding hierarchical textual content. Subsequently, the generated hierarchical content is diffused to obtain images with rich hierarchical semantic information. This stage contributes to enhancing the model's understanding of hierarchical information in the original content. Following this, we design the multi-evidence alignment fusion module, which combines the generated textual and image evidence, fully leveraging information from different sources to improve extraction accuracy. Experimental results demonstrate that our model achieves F1 scores of 76.29 %, 87.66 %, and 87.34 % on the Twitter2015, Twitter2017, and MNRE datasets, respectively. These results surpass the previous state-of-the-art models by 0.29 %, 0.1 %, and 2.77 %. Furthermore, our model demonstrates superior performance in low-resource scenarios, confirming its effectiveness. The related code can be found at <span><span>https://github.com/lsx314/HGMAF</span><svg><path></path></svg></span>.</p></div>","PeriodicalId":50365,"journal":{"name":"Information Processing & Management","volume":null,"pages":null},"PeriodicalIF":7.4000,"publicationDate":"2024-09-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S0306457324002346/pdfft?md5=e619cd49017958045ad28bee7549ebe9&pid=1-s2.0-S0306457324002346-main.pdf","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Information Processing & Management","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0306457324002346","RegionNum":1,"RegionCategory":"管理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}
引用次数: 0
Abstract
Multimodal Entity and Relation Extraction (MERE) encompasses tasks, including Multimodal Named Entity Recognition (MNER) and Multimodal Relation Extraction (MRE), aiming to extract valuable information from environments rich in multimodal data. Currently, many research endeavors face various challenges, including the insufficient utilization of emotional information in multimodal data, mismatches between textual and visual content, ambiguous meanings, and difficulties achieving precise alignment across different semantic levels. To address these issues, we propose the Hierarchical Generation of Multi Evidence Alignment Fusion Model for Multimodal Entity and Relation Extraction (HGMAF). This model comprises a hierarchical diffusion semantic generation stage and a multi-evidence alignment fusion module. Initially, we designed different prompt templates for the original text, using the Large Language Model (LLM) to generate corresponding hierarchical textual content. Subsequently, the generated hierarchical content is diffused to obtain images with rich hierarchical semantic information. This stage contributes to enhancing the model's understanding of hierarchical information in the original content. Following this, we design the multi-evidence alignment fusion module, which combines the generated textual and image evidence, fully leveraging information from different sources to improve extraction accuracy. Experimental results demonstrate that our model achieves F1 scores of 76.29 %, 87.66 %, and 87.34 % on the Twitter2015, Twitter2017, and MNRE datasets, respectively. These results surpass the previous state-of-the-art models by 0.29 %, 0.1 %, and 2.77 %. Furthermore, our model demonstrates superior performance in low-resource scenarios, confirming its effectiveness. The related code can be found at https://github.com/lsx314/HGMAF.
期刊介绍:
Information Processing and Management is dedicated to publishing cutting-edge original research at the convergence of computing and information science. Our scope encompasses theory, methods, and applications across various domains, including advertising, business, health, information science, information technology marketing, and social computing.
We aim to cater to the interests of both primary researchers and practitioners by offering an effective platform for the timely dissemination of advanced and topical issues in this interdisciplinary field. The journal places particular emphasis on original research articles, research survey articles, research method articles, and articles addressing critical applications of research. Join us in advancing knowledge and innovation at the intersection of computing and information science.