当GDD遇到GNN时：一种知识驱动的神经连接，用于属性图中有效的实体解析

IF 3 2区计算机科学 Q2 COMPUTER SCIENCE, INFORMATION SYSTEMS

Information Systems Pub Date : 2025-03-22 DOI:10.1016/j.is.2025.102551

Junwei Hu , Michael Bewong , Selasi Kwashie , Yidi Zhang , Vincent Nofong , John Wondoh , Zaiwen Feng

{"title":"当GDD遇到GNN时：一种知识驱动的神经连接，用于属性图中有效的实体解析","authors":"Junwei Hu , Michael Bewong , Selasi Kwashie , Yidi Zhang , Vincent Nofong , John Wondoh , Zaiwen Feng","doi":"10.1016/j.is.2025.102551","DOIUrl":null,"url":null,"abstract":"<div><div>This paper studies the entity resolution (ER) problem in property graphs. ER is the task of identifying and linking different records that refer to the same real-world entity. It is commonly used in data integration, data cleansing, and other applications where it is important to have accurate and consistent data. In general, two predominant approaches exist in the literature: rule-based and learning-based methods. On the one hand, rule-based techniques are often desired due to their explainability and ability to encode domain knowledge. Learning-based methods, on the other hand, are preferred due to their effectiveness in spite of their black-box nature. In this work, we devise a hybrid ER solution, <span>GraphER</span>, that leverages the strengths of both systems for property graphs. In particular, we adopt <em>graph differential dependency</em> (GDD) for encoding the so-called <em>record-matching rules</em>, and employ them to guide a graph neural network (GNN) based representation learning for the task. We conduct extensive empirical evaluation of our proposal on benchmark ER datasets including 17 graph datasets and 7 relational datasets in comparison with 10 state-of-the-art (SOTA) techniques. The results show that our approach provides a significantly better solution to addressing ER in graph data, both quantitatively and qualitatively, while attaining highly competitive results on the benchmark relational datasets <em>w.r.t.</em> the SOTA solutions.</div></div>","PeriodicalId":50363,"journal":{"name":"Information Systems","volume":"132 ","pages":"Article 102551"},"PeriodicalIF":3.0000,"publicationDate":"2025-03-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"When GDD meets GNN: A knowledge-driven neural connection for effective entity resolution in property graphs\",\"authors\":\"Junwei Hu , Michael Bewong , Selasi Kwashie , Yidi Zhang , Vincent Nofong , John Wondoh , Zaiwen Feng\",\"doi\":\"10.1016/j.is.2025.102551\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><div>This paper studies the entity resolution (ER) problem in property graphs. ER is the task of identifying and linking different records that refer to the same real-world entity. It is commonly used in data integration, data cleansing, and other applications where it is important to have accurate and consistent data. In general, two predominant approaches exist in the literature: rule-based and learning-based methods. On the one hand, rule-based techniques are often desired due to their explainability and ability to encode domain knowledge. Learning-based methods, on the other hand, are preferred due to their effectiveness in spite of their black-box nature. In this work, we devise a hybrid ER solution, <span>GraphER</span>, that leverages the strengths of both systems for property graphs. In particular, we adopt <em>graph differential dependency</em> (GDD) for encoding the so-called <em>record-matching rules</em>, and employ them to guide a graph neural network (GNN) based representation learning for the task. We conduct extensive empirical evaluation of our proposal on benchmark ER datasets including 17 graph datasets and 7 relational datasets in comparison with 10 state-of-the-art (SOTA) techniques. The results show that our approach provides a significantly better solution to addressing ER in graph data, both quantitatively and qualitatively, while attaining highly competitive results on the benchmark relational datasets <em>w.r.t.</em> the SOTA solutions.</div></div>\",\"PeriodicalId\":50363,\"journal\":{\"name\":\"Information Systems\",\"volume\":\"132 \",\"pages\":\"Article 102551\"},\"PeriodicalIF\":3.0000,\"publicationDate\":\"2025-03-22\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Information Systems\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S0306437925000365\",\"RegionNum\":2,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"COMPUTER SCIENCE, INFORMATION SYSTEMS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Information Systems","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0306437925000365","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}

引用次数: 0

摘要

研究了属性图中的实体解析问题。ER的任务是识别和链接引用相同现实世界实体的不同记录。它通常用于数据集成、数据清理和其他需要准确和一致数据的应用程序中。一般来说，文献中存在两种主要的方法：基于规则的方法和基于学习的方法。一方面，基于规则的技术由于其可解释性和编码领域知识的能力而经常被需要。另一方面，基于学习的方法由于其有效性而受到青睐，尽管它们具有黑箱性质。在这项工作中，我们设计了一个混合ER解决方案，graph，它利用了两个系统的优势来处理属性图。特别地，我们采用图微分依赖（GDD）来编码所谓的记录匹配规则，并利用它们来指导基于图神经网络（GNN）的任务表示学习。我们对基准ER数据集进行了广泛的实证评估，其中包括17个图数据集和7个关系数据集，并与10个最先进的（SOTA）技术进行了比较。结果表明，我们的方法提供了一个更好的解决方案来处理图数据中的ER，无论是定量的还是定性的，同时在基准关系数据集上获得了与SOTA解决方案相比具有很强竞争力的结果。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

When GDD meets GNN: A knowledge-driven neural connection for effective entity resolution in property graphs

This paper studies the entity resolution (ER) problem in property graphs. ER is the task of identifying and linking different records that refer to the same real-world entity. It is commonly used in data integration, data cleansing, and other applications where it is important to have accurate and consistent data. In general, two predominant approaches exist in the literature: rule-based and learning-based methods. On the one hand, rule-based techniques are often desired due to their explainability and ability to encode domain knowledge. Learning-based methods, on the other hand, are preferred due to their effectiveness in spite of their black-box nature. In this work, we devise a hybrid ER solution, GraphER, that leverages the strengths of both systems for property graphs. In particular, we adopt graph differential dependency (GDD) for encoding the so-called record-matching rules, and employ them to guide a graph neural network (GNN) based representation learning for the task. We conduct extensive empirical evaluation of our proposal on benchmark ER datasets including 17 graph datasets and 7 relational datasets in comparison with 10 state-of-the-art (SOTA) techniques. The results show that our approach provides a significantly better solution to addressing ER in graph data, both quantitatively and qualitatively, while attaining highly competitive results on the benchmark relational datasets w.r.t. the SOTA solutions.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Information Systems 工程技术-计算机：信息系统

CiteScore

9.40

自引率

2.70%

发文量

112

审稿时长

53 days

期刊介绍： Information systems are the software and hardware systems that support data-intensive applications. The journal Information Systems publishes articles concerning the design and implementation of languages, data models, process models, algorithms, software and hardware for information systems. Subject areas include data management issues as presented in the principal international database conferences (e.g., ACM SIGMOD/PODS, VLDB, ICDE and ICDT/EDBT) as well as data-related issues from the fields of data mining/machine learning, information retrieval coordinated with structured data, internet and cloud data management, business process management, web semantics, visual and audio information systems, scientific computing, and data science. Implementation papers having to do with massively parallel data management, fault tolerance in practice, and special purpose hardware for data-intensive systems are also welcome. Manuscripts from application domains, such as urban informatics, social and natural science, and Internet of Things, are also welcome. All papers should highlight innovative solutions to data management problems such as new data models, performance enhancements, and show how those innovations contribute to the goals of the application.