{"title":"Linking Entities across Relations and Graphs","authors":"Wenfei Fan, Ping Lu, Kehan Pang, Ruochun Jin","doi":"10.1145/3639363","DOIUrl":null,"url":null,"abstract":"<p>This paper proposes a notion of parametric simulation to link entities across a relational database \\(\\mathcal {D} \\) and a graph <i>G</i>. Taking functions and thresholds for measuring vertex closeness, path associations and important properties as parameters, parametric simulation identifies tuples <i>t</i> in \\(\\mathcal {D} \\) and vertices <i>v</i> in <i>G</i> that refer to the same real-world entity, based on both topological and semantic matching. We develop machine learning methods to learn the parameter functions and thresholds. We show that parametric simulation is in quadratic-time, by providing such an algorithm. Moreover, we develop an incremental algorithm for parametric simulation; we show that the incremental algorithm is bounded relative to its batch counterpart, <i>i.e.,</i> it incurs the minimum cost for incrementalizing the batch algorithm. Putting these together, we develop HER, a parallel system to check whether (<i>t</i>, <i>v</i>) makes a match, find all vertex matches of <i>t</i> in <i>G</i>, and compute all matches across \\(\\mathcal {D} \\) and <i>G</i>, all in quadratic-time; moreover, HER supports incremental computation of these in response to updates to \\(\\mathcal {D} \\) and <i>G</i>. Using real-life and synthetic data, we empirically verify that HER is accurate with F-measure of 0.94 on average, and is able to scale with database \\(\\mathcal {D} \\) and graph <i>G</i> for both batch and incremental computations.</p>","PeriodicalId":50915,"journal":{"name":"ACM Transactions on Database Systems","volume":"6 1","pages":""},"PeriodicalIF":2.2000,"publicationDate":"2024-01-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"ACM Transactions on Database Systems","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1145/3639363","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}
引用次数: 0
Abstract
This paper proposes a notion of parametric simulation to link entities across a relational database \(\mathcal {D} \) and a graph G. Taking functions and thresholds for measuring vertex closeness, path associations and important properties as parameters, parametric simulation identifies tuples t in \(\mathcal {D} \) and vertices v in G that refer to the same real-world entity, based on both topological and semantic matching. We develop machine learning methods to learn the parameter functions and thresholds. We show that parametric simulation is in quadratic-time, by providing such an algorithm. Moreover, we develop an incremental algorithm for parametric simulation; we show that the incremental algorithm is bounded relative to its batch counterpart, i.e., it incurs the minimum cost for incrementalizing the batch algorithm. Putting these together, we develop HER, a parallel system to check whether (t, v) makes a match, find all vertex matches of t in G, and compute all matches across \(\mathcal {D} \) and G, all in quadratic-time; moreover, HER supports incremental computation of these in response to updates to \(\mathcal {D} \) and G. Using real-life and synthetic data, we empirically verify that HER is accurate with F-measure of 0.94 on average, and is able to scale with database \(\mathcal {D} \) and graph G for both batch and incremental computations.
本文提出了一个参数模拟的概念,用于连接关系数据库 \(\mathcal {D} \)和图 G 中的实体。以测量顶点接近度、路径关联和重要属性的函数和阈值为参数,参数模拟根据拓扑和语义匹配,识别出 \(\mathcal {D} \)中的图元 t 和图 G 中的顶点 v,它们指的是同一个现实世界中的实体。我们开发了机器学习方法来学习参数函数和阈值。通过提供这样一种算法,我们证明了参数模拟的二次方时间。此外,我们还为参数模拟开发了一种增量算法;我们证明,相对于批量算法,增量算法是有界的,也就是说,批量算法的增量成本最小。将这些结合起来,我们开发了 HER,这是一个并行系统,可以检查(t, v)是否匹配,在 G 中找到 t 的所有顶点匹配,并在\(\mathcal {D} \)和 G 中计算所有匹配,所有这些都在二次时间内完成;此外,HER 支持根据\(\mathcal {D} \)和 G 的更新增量计算。通过使用真实数据和合成数据,我们实证验证了 HER 的准确性,其平均 F-measure 值为 0.94,并且能够随着数据库 \(\mathcal {D} \) 和图 G 的批量计算和增量计算而扩展。
期刊介绍:
Heavily used in both academic and corporate R&D settings, ACM Transactions on Database Systems (TODS) is a key publication for computer scientists working in data abstraction, data modeling, and designing data management systems. Topics include storage and retrieval, transaction management, distributed and federated databases, semantics of data, intelligent databases, and operations and algorithms relating to these areas. In this rapidly changing field, TODS provides insights into the thoughts of the best minds in database R&D.