{"title":"ME3A:用于多模态实体对齐的多模态实体关联框架","authors":"Yu Zhao, Ying Zhang, Xuhui Sui, Xiangrui Cai","doi":"10.1016/j.ipm.2024.103951","DOIUrl":null,"url":null,"abstract":"<div><div>Current methods for multimodal entity alignment (MEA) primarily rely on entity representation learning, which undermines entity alignment performance because of cross-KG interaction deficiency and multimodal heterogeneity. In this paper, we propose a <strong>M</strong>ultimodal <strong>E</strong>ntity <strong>E</strong>ntailment framework of multimodal <strong>E</strong>ntity <strong>A</strong>lignment task, <strong>ME<sup>3</sup>A</strong>, and recast the MEA task as an entailment problem about entities in the two KGs. This way, the cross-KG modality information directly interacts with each other in the unified textual space. Specifically, we construct the multimodal information in the unified textual space as textual sequences: for relational and attribute modalities, we combine the neighbors and attribute values of entities as sentences; for visual modality, we map the entity image as trainable prefixes and insert them into sequences. Then, we input the concatenated sequences of two entities into the pre-trained language model (PLM) as an entailment reasoner to capture the unified fine-grained correlation pattern of the multimodal tokens between entities. Two types of entity aligners are proposed to model the bi-directional entailment probability as the entity similarity. Extensive experiments conducted on nine MEA datasets with various modality combination settings demonstrate that our ME<span><math><msup><mrow></mrow><mrow><mn>3</mn></mrow></msup></math></span>A effectively incorporates multimodal information and surpasses the performance of the state-of-the-art MEA methods by 16.5% at most.</div></div>","PeriodicalId":50365,"journal":{"name":"Information Processing & Management","volume":null,"pages":null},"PeriodicalIF":7.4000,"publicationDate":"2024-11-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"ME3A: A Multimodal Entity Entailment framework for multimodal Entity Alignment\",\"authors\":\"Yu Zhao, Ying Zhang, Xuhui Sui, Xiangrui Cai\",\"doi\":\"10.1016/j.ipm.2024.103951\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><div>Current methods for multimodal entity alignment (MEA) primarily rely on entity representation learning, which undermines entity alignment performance because of cross-KG interaction deficiency and multimodal heterogeneity. In this paper, we propose a <strong>M</strong>ultimodal <strong>E</strong>ntity <strong>E</strong>ntailment framework of multimodal <strong>E</strong>ntity <strong>A</strong>lignment task, <strong>ME<sup>3</sup>A</strong>, and recast the MEA task as an entailment problem about entities in the two KGs. This way, the cross-KG modality information directly interacts with each other in the unified textual space. Specifically, we construct the multimodal information in the unified textual space as textual sequences: for relational and attribute modalities, we combine the neighbors and attribute values of entities as sentences; for visual modality, we map the entity image as trainable prefixes and insert them into sequences. Then, we input the concatenated sequences of two entities into the pre-trained language model (PLM) as an entailment reasoner to capture the unified fine-grained correlation pattern of the multimodal tokens between entities. Two types of entity aligners are proposed to model the bi-directional entailment probability as the entity similarity. Extensive experiments conducted on nine MEA datasets with various modality combination settings demonstrate that our ME<span><math><msup><mrow></mrow><mrow><mn>3</mn></mrow></msup></math></span>A effectively incorporates multimodal information and surpasses the performance of the state-of-the-art MEA methods by 16.5% at most.</div></div>\",\"PeriodicalId\":50365,\"journal\":{\"name\":\"Information Processing & Management\",\"volume\":null,\"pages\":null},\"PeriodicalIF\":7.4000,\"publicationDate\":\"2024-11-05\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Information Processing & Management\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S0306457324003108\",\"RegionNum\":1,\"RegionCategory\":\"管理学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"COMPUTER SCIENCE, INFORMATION SYSTEMS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Information Processing & Management","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0306457324003108","RegionNum":1,"RegionCategory":"管理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}
引用次数: 0
摘要
目前的多模态实体配准(MEA)方法主要依赖于实体表征学习,但由于跨 KG 交互缺陷和多模态异质性,实体配准性能受到影响。本文提出了多模态实体对齐任务 ME3A 的多模态实体枚举框架,并将多模态实体对齐任务重塑为两个 KG 中实体的枚举问题。这样,跨 KG 的模态信息就可以在统一的文本空间中直接交互。具体来说,我们将统一文本空间中的多模态信息构建为文本序列:对于关系模态和属性模态,我们将实体的相邻关系和属性值组合为句子;对于视觉模态,我们将实体图像映射为可训练的前缀,并将其插入序列中。然后,我们将两个实体的串联序列输入预先训练好的语言模型(PLM)作为蕴涵推理器,以捕捉实体间多模态标记的统一细粒度关联模式。我们提出了两类实体对齐器,将双向 "entailment probability "建模为实体相似性。在九个具有不同模态组合设置的 MEA 数据集上进行的广泛实验表明,我们的 ME3A 有效地整合了多模态信息,其性能最多比最先进的 MEA 方法高出 16.5%。
ME3A: A Multimodal Entity Entailment framework for multimodal Entity Alignment
Current methods for multimodal entity alignment (MEA) primarily rely on entity representation learning, which undermines entity alignment performance because of cross-KG interaction deficiency and multimodal heterogeneity. In this paper, we propose a Multimodal Entity Entailment framework of multimodal Entity Alignment task, ME3A, and recast the MEA task as an entailment problem about entities in the two KGs. This way, the cross-KG modality information directly interacts with each other in the unified textual space. Specifically, we construct the multimodal information in the unified textual space as textual sequences: for relational and attribute modalities, we combine the neighbors and attribute values of entities as sentences; for visual modality, we map the entity image as trainable prefixes and insert them into sequences. Then, we input the concatenated sequences of two entities into the pre-trained language model (PLM) as an entailment reasoner to capture the unified fine-grained correlation pattern of the multimodal tokens between entities. Two types of entity aligners are proposed to model the bi-directional entailment probability as the entity similarity. Extensive experiments conducted on nine MEA datasets with various modality combination settings demonstrate that our MEA effectively incorporates multimodal information and surpasses the performance of the state-of-the-art MEA methods by 16.5% at most.
期刊介绍:
Information Processing and Management is dedicated to publishing cutting-edge original research at the convergence of computing and information science. Our scope encompasses theory, methods, and applications across various domains, including advertising, business, health, information science, information technology marketing, and social computing.
We aim to cater to the interests of both primary researchers and practitioners by offering an effective platform for the timely dissemination of advanced and topical issues in this interdisciplinary field. The journal places particular emphasis on original research articles, research survey articles, research method articles, and articles addressing critical applications of research. Join us in advancing knowledge and innovation at the intersection of computing and information science.