{"title":"用于多模态知识图谱补全的多跳邻居融合增强型分层变换器","authors":"Yunpeng Wang, Bo Ning, Xin Wang, Guanyu Li","doi":"10.1007/s11280-024-01289-w","DOIUrl":null,"url":null,"abstract":"<p>Multi-modal knowledge graph (MKG) refers to a structured semantic network that accurately represents the real-world information by incorporating multiple modalities. Existing researches primarily focus on leveraging multi-modal fusion to enhance the representation capability of entity nodes and link prediction to deal with the incompleteness of the MKG. However, the inherent heterogeneity between structural modality and semantic modality poses challenges to the multi-modal fusion, as noise interference could compromise the effectiveness of the fusion representation. In this study, we propose a novel hierarchical Transformer architecture, named MNFormer, which captures the structural and semantic information while avoiding heterogeneity issues by fully integrating both multi-hop neighbor paths and image-text embeddings. During the encoding stage of MNFormer, we design multiple layers of Multi-hop Neighbor Fusion (MNF) module that employ attentions to merge the image and text features. These MNF modules progressively fuse the information of neighboring entities hop by hop along the neighbor paths of the source entity. The Transformer during decoding stage is then utilized to integrate the outputs of all MNF modules, whose output is subsequently employed to match target entities and accomplish MKG completion. Moreover, we develop a semantic direction loss to enhance the fitting performance of MNFormer. Experimental results on four datasets demonstrate that MNFormer exhibits notable competitiveness when compared to the state-of-the-art models. Additionally, ablation studies showcase the significant ability of MNFormer to effectively combine structural and semantic information, leading to enhanced performance through complementary enhancements.</p>","PeriodicalId":501180,"journal":{"name":"World Wide Web","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2024-07-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Multi-hop neighbor fusion enhanced hierarchical transformer for multi-modal knowledge graph completion\",\"authors\":\"Yunpeng Wang, Bo Ning, Xin Wang, Guanyu Li\",\"doi\":\"10.1007/s11280-024-01289-w\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p>Multi-modal knowledge graph (MKG) refers to a structured semantic network that accurately represents the real-world information by incorporating multiple modalities. Existing researches primarily focus on leveraging multi-modal fusion to enhance the representation capability of entity nodes and link prediction to deal with the incompleteness of the MKG. However, the inherent heterogeneity between structural modality and semantic modality poses challenges to the multi-modal fusion, as noise interference could compromise the effectiveness of the fusion representation. In this study, we propose a novel hierarchical Transformer architecture, named MNFormer, which captures the structural and semantic information while avoiding heterogeneity issues by fully integrating both multi-hop neighbor paths and image-text embeddings. During the encoding stage of MNFormer, we design multiple layers of Multi-hop Neighbor Fusion (MNF) module that employ attentions to merge the image and text features. These MNF modules progressively fuse the information of neighboring entities hop by hop along the neighbor paths of the source entity. The Transformer during decoding stage is then utilized to integrate the outputs of all MNF modules, whose output is subsequently employed to match target entities and accomplish MKG completion. Moreover, we develop a semantic direction loss to enhance the fitting performance of MNFormer. Experimental results on four datasets demonstrate that MNFormer exhibits notable competitiveness when compared to the state-of-the-art models. Additionally, ablation studies showcase the significant ability of MNFormer to effectively combine structural and semantic information, leading to enhanced performance through complementary enhancements.</p>\",\"PeriodicalId\":501180,\"journal\":{\"name\":\"World Wide Web\",\"volume\":null,\"pages\":null},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-07-19\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"World Wide Web\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1007/s11280-024-01289-w\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"World Wide Web","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1007/s11280-024-01289-w","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Multi-modal knowledge graph (MKG) refers to a structured semantic network that accurately represents the real-world information by incorporating multiple modalities. Existing researches primarily focus on leveraging multi-modal fusion to enhance the representation capability of entity nodes and link prediction to deal with the incompleteness of the MKG. However, the inherent heterogeneity between structural modality and semantic modality poses challenges to the multi-modal fusion, as noise interference could compromise the effectiveness of the fusion representation. In this study, we propose a novel hierarchical Transformer architecture, named MNFormer, which captures the structural and semantic information while avoiding heterogeneity issues by fully integrating both multi-hop neighbor paths and image-text embeddings. During the encoding stage of MNFormer, we design multiple layers of Multi-hop Neighbor Fusion (MNF) module that employ attentions to merge the image and text features. These MNF modules progressively fuse the information of neighboring entities hop by hop along the neighbor paths of the source entity. The Transformer during decoding stage is then utilized to integrate the outputs of all MNF modules, whose output is subsequently employed to match target entities and accomplish MKG completion. Moreover, we develop a semantic direction loss to enhance the fitting performance of MNFormer. Experimental results on four datasets demonstrate that MNFormer exhibits notable competitiveness when compared to the state-of-the-art models. Additionally, ablation studies showcase the significant ability of MNFormer to effectively combine structural and semantic information, leading to enhanced performance through complementary enhancements.