基于时间增强和相似性驱动注意力的联合多模态实体关联提取

IF 7.2 1区计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Knowledge-Based Systems Pub Date : 2024-09-11 DOI:10.1016/j.knosys.2024.112504

Guoxiang Wang, Jin Liu, Jialong Xie, Zhenwei Zhu, Fengyu Zhou

{"title":"基于时间增强和相似性驱动注意力的联合多模态实体关联提取","authors":"Guoxiang Wang, Jin Liu, Jialong Xie, Zhenwei Zhu, Fengyu Zhou","doi":"10.1016/j.knosys.2024.112504","DOIUrl":null,"url":null,"abstract":"<div><p>Joint Multimodal Entity and Relation Extraction (JMERE), which needs to combine complex image information to extract entity-relation quintuples from text sequences, posts higher requirements of the model’s multimodal feature fusion and selection capabilities. With the advancement of large pre-trained language models, existing studies focus on improving the feature alignments between textual and visual modalities. However, there remains a noticeable gap in capturing the temporal information present in textual sequences. In addition, these methods exhibit a certain deficiency in distinguishing irrelevant images when integrating image and text features, making them susceptible to interference from image information unrelated to the text. To address these challenges, we propose a temporally enhanced and similarity-gated attention network (TESGA) for joint multimodal entity relation extraction. Specifically, we first incorporate an LSTM-based Text Temporal Enhancement module to enhance the model’s ability to capture temporal information from the text. Next, we introduce a Text-Image Similarity-Gated Attention mechanism, which controls the degree of incorporating image information based on the consistency between image and text features. Subsequently, We design the entity and relation prediction module using a form-filling approach based on entity and relation types, and conduct prediction of entity-relation quintuples. Notably, apart from the JMERE task, our approach can also be applied to other tasks involving text-visual enhancement, such as Multimodal Named Entity Recognition (MNER) and Multimodal Relation Extraction (MRE). To demonstrate the effectiveness of our approach, our model is extensively experimented on three benchmark datasets and achieves state-of-the-art performance. Our code will be available upon paper acceptance.<span><span><sup>1</sup></span></span></p></div>","PeriodicalId":49939,"journal":{"name":"Knowledge-Based Systems","volume":"304 ","pages":"Article 112504"},"PeriodicalIF":7.2000,"publicationDate":"2024-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Joint multimodal entity-relation extraction based on temporal enhancement and similarity-gated attention\",\"authors\":\"Guoxiang Wang, Jin Liu, Jialong Xie, Zhenwei Zhu, Fengyu Zhou\",\"doi\":\"10.1016/j.knosys.2024.112504\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><p>Joint Multimodal Entity and Relation Extraction (JMERE), which needs to combine complex image information to extract entity-relation quintuples from text sequences, posts higher requirements of the model’s multimodal feature fusion and selection capabilities. With the advancement of large pre-trained language models, existing studies focus on improving the feature alignments between textual and visual modalities. However, there remains a noticeable gap in capturing the temporal information present in textual sequences. In addition, these methods exhibit a certain deficiency in distinguishing irrelevant images when integrating image and text features, making them susceptible to interference from image information unrelated to the text. To address these challenges, we propose a temporally enhanced and similarity-gated attention network (TESGA) for joint multimodal entity relation extraction. Specifically, we first incorporate an LSTM-based Text Temporal Enhancement module to enhance the model’s ability to capture temporal information from the text. Next, we introduce a Text-Image Similarity-Gated Attention mechanism, which controls the degree of incorporating image information based on the consistency between image and text features. Subsequently, We design the entity and relation prediction module using a form-filling approach based on entity and relation types, and conduct prediction of entity-relation quintuples. Notably, apart from the JMERE task, our approach can also be applied to other tasks involving text-visual enhancement, such as Multimodal Named Entity Recognition (MNER) and Multimodal Relation Extraction (MRE). To demonstrate the effectiveness of our approach, our model is extensively experimented on three benchmark datasets and achieves state-of-the-art performance. Our code will be available upon paper acceptance.<span><span><sup>1</sup></span></span></p></div>\",\"PeriodicalId\":49939,\"journal\":{\"name\":\"Knowledge-Based Systems\",\"volume\":\"304 \",\"pages\":\"Article 112504\"},\"PeriodicalIF\":7.2000,\"publicationDate\":\"2024-09-11\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Knowledge-Based Systems\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S0950705124011389\",\"RegionNum\":1,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Knowledge-Based Systems","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0950705124011389","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

摘要

多模态实体和关系联合提取（JMERE）需要结合复杂的图像信息，从文本序列中提取实体-关系五元组，这对模型的多模态特征融合和选择能力提出了更高的要求。随着大型预训练语言模型的发展，现有研究侧重于改进文本和视觉模式之间的特征对齐。然而，在捕捉文本序列中的时间信息方面仍存在明显差距。此外，在整合图像和文本特征时，这些方法在区分无关图像方面存在一定缺陷，因此容易受到与文本无关的图像信息的干扰。为了应对这些挑战，我们提出了一种用于联合多模态实体关系提取的时间增强和相似性门控注意力网络（TESGA）。具体来说，我们首先整合了基于 LSTM 的文本时态增强模块，以增强模型捕捉文本中时态信息的能力。接下来，我们引入了文本-图像相似性导向注意机制，该机制根据图像和文本特征之间的一致性来控制图像信息的纳入程度。随后，我们设计了实体和关系预测模块，采用基于实体和关系类型的填表方法，对实体-关系五元组进行预测。值得注意的是，除了 JMERE 任务，我们的方法还可应用于其他涉及文本-视觉增强的任务，如多模态命名实体识别（MNER）和多模态关系提取（MRE）。为了证明我们方法的有效性，我们的模型在三个基准数据集上进行了广泛的实验，并取得了一流的性能。我们的代码将在论文被接受后提供1。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Joint multimodal entity-relation extraction based on temporal enhancement and similarity-gated attention

Joint Multimodal Entity and Relation Extraction (JMERE), which needs to combine complex image information to extract entity-relation quintuples from text sequences, posts higher requirements of the model’s multimodal feature fusion and selection capabilities. With the advancement of large pre-trained language models, existing studies focus on improving the feature alignments between textual and visual modalities. However, there remains a noticeable gap in capturing the temporal information present in textual sequences. In addition, these methods exhibit a certain deficiency in distinguishing irrelevant images when integrating image and text features, making them susceptible to interference from image information unrelated to the text. To address these challenges, we propose a temporally enhanced and similarity-gated attention network (TESGA) for joint multimodal entity relation extraction. Specifically, we first incorporate an LSTM-based Text Temporal Enhancement module to enhance the model’s ability to capture temporal information from the text. Next, we introduce a Text-Image Similarity-Gated Attention mechanism, which controls the degree of incorporating image information based on the consistency between image and text features. Subsequently, We design the entity and relation prediction module using a form-filling approach based on entity and relation types, and conduct prediction of entity-relation quintuples. Notably, apart from the JMERE task, our approach can also be applied to other tasks involving text-visual enhancement, such as Multimodal Named Entity Recognition (MNER) and Multimodal Relation Extraction (MRE). To demonstrate the effectiveness of our approach, our model is extensively experimented on three benchmark datasets and achieves state-of-the-art performance. Our code will be available upon paper acceptance.¹

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Knowledge-Based Systems 工程技术-计算机：人工智能

CiteScore

14.80

自引率

12.50%

发文量

1245

审稿时长

7.8 months

期刊介绍： Knowledge-Based Systems, an international and interdisciplinary journal in artificial intelligence, publishes original, innovative, and creative research results in the field. It focuses on knowledge-based and other artificial intelligence techniques-based systems. The journal aims to support human prediction and decision-making through data science and computation techniques, provide a balanced coverage of theory and practical study, and encourage the development and implementation of knowledge-based intelligence models, methods, systems, and software tools. Applications in business, government, education, engineering, and healthcare are emphasized.