一种用于可扩展实体解析的监督元阻塞的通用方法

IF 3 2区计算机科学 Q2 COMPUTER SCIENCE, INFORMATION SYSTEMS

Information Systems Pub Date : 2023-10-17 DOI:10.1016/j.is.2023.102307

Luca Gagliardelli , George Papadakis , Giovanni Simonini , Sonia Bergamaschi , Themis Palpanas

{"title":"一种用于可扩展实体解析的监督元阻塞的通用方法","authors":"Luca Gagliardelli , George Papadakis , Giovanni Simonini , Sonia Bergamaschi , Themis Palpanas","doi":"10.1016/j.is.2023.102307","DOIUrl":null,"url":null,"abstract":"<div><p>Entity Resolution (ER) constitutes a core data integration task that relies on Blocking in order to tame its quadratic time complexity. Schema-agnostic blocking achieves very high recall, requires no domain knowledge and applies to data of any structuredness and schema heterogeneity. This comes at the cost of many irrelevant candidate pairs (i.e., comparisons), which can be significantly reduced through Meta-blocking techniques, i.e., techniques that leverage the co-occurrence patterns of entities inside the blocks: first, a weighting scheme assigns a score to every pair of candidate entities in proportion to the likelihood that they are matching and then, a pruning algorithm discards the pairs with the lowest scores. Supervised Meta-blocking goes beyond this approach by combining multiple scores per comparison into a feature vector that is fed to a binary classifier. By using probabilistic classifiers, Generalized Supervised Meta-blocking associates every pair of candidates with a score that can be used: <em>(i)</em> by any pruning algorithm for retaining the set of candidate comparisons; and <em>(ii)</em> by state-of-the-art progressive ER methods to identify the most promising candidates as early as possible (when time is a critical component for the downstream applications that consume the data). For higher effectiveness, new weighting schemes are examined as features. Through an extensive experimental analysis, we identify the best pruning algorithms, their optimal sets of features as well as the minimum possible size of the training set. The resulting approaches achieve excellent performance across several established benchmark datasets.</p></div>","PeriodicalId":50363,"journal":{"name":"Information Systems","volume":null,"pages":null},"PeriodicalIF":3.0000,"publicationDate":"2023-10-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S0306437923001436/pdfft?md5=d13ea8d3cba53c027d1df8065d1feffc&pid=1-s2.0-S0306437923001436-main.pdf","citationCount":"0","resultStr":"{\"title\":\"GSM: A generalized approach to Supervised Meta-blocking for scalable entity resolution\",\"authors\":\"Luca Gagliardelli , George Papadakis , Giovanni Simonini , Sonia Bergamaschi , Themis Palpanas\",\"doi\":\"10.1016/j.is.2023.102307\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><p>Entity Resolution (ER) constitutes a core data integration task that relies on Blocking in order to tame its quadratic time complexity. Schema-agnostic blocking achieves very high recall, requires no domain knowledge and applies to data of any structuredness and schema heterogeneity. This comes at the cost of many irrelevant candidate pairs (i.e., comparisons), which can be significantly reduced through Meta-blocking techniques, i.e., techniques that leverage the co-occurrence patterns of entities inside the blocks: first, a weighting scheme assigns a score to every pair of candidate entities in proportion to the likelihood that they are matching and then, a pruning algorithm discards the pairs with the lowest scores. Supervised Meta-blocking goes beyond this approach by combining multiple scores per comparison into a feature vector that is fed to a binary classifier. By using probabilistic classifiers, Generalized Supervised Meta-blocking associates every pair of candidates with a score that can be used: <em>(i)</em> by any pruning algorithm for retaining the set of candidate comparisons; and <em>(ii)</em> by state-of-the-art progressive ER methods to identify the most promising candidates as early as possible (when time is a critical component for the downstream applications that consume the data). For higher effectiveness, new weighting schemes are examined as features. Through an extensive experimental analysis, we identify the best pruning algorithms, their optimal sets of features as well as the minimum possible size of the training set. The resulting approaches achieve excellent performance across several established benchmark datasets.</p></div>\",\"PeriodicalId\":50363,\"journal\":{\"name\":\"Information Systems\",\"volume\":null,\"pages\":null},\"PeriodicalIF\":3.0000,\"publicationDate\":\"2023-10-17\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://www.sciencedirect.com/science/article/pii/S0306437923001436/pdfft?md5=d13ea8d3cba53c027d1df8065d1feffc&pid=1-s2.0-S0306437923001436-main.pdf\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Information Systems\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S0306437923001436\",\"RegionNum\":2,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"COMPUTER SCIENCE, INFORMATION SYSTEMS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Information Systems","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0306437923001436","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}

引用次数: 0

摘要

实体解析(ER)是一项核心的数据集成任务，它依赖于block来控制其二次时间复杂度。模式不可知的阻塞实现了很高的召回率，不需要领域知识，适用于任何结构和模式异构的数据。这是以许多不相关的候选对(即比较)为代价的，这可以通过元阻塞技术显着减少，即利用块内实体的共现模式的技术:首先，加权方案根据它们匹配的可能性为每对候选实体分配一个分数，然后，修剪算法丢弃得分最低的对。监督式元阻塞超越了这种方法，它将每次比较的多个分数组合到一个特征向量中，并将其馈送给二元分类器。通过使用概率分类器，广义监督元阻塞将每一对候选对象与一个分数相关联，该分数可用于:(i)保留候选比较集的任何修剪算法;(ii)采用先进的ER方法尽早识别最有希望的候选数据(对于使用数据的下游应用程序来说，时间是一个关键因素)。为了提高效率，新的加权方案作为特征进行了研究。通过广泛的实验分析，我们确定了最佳修剪算法，它们的最优特征集以及训练集的最小可能大小。由此产生的方法在多个已建立的基准数据集上实现了出色的性能。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

GSM: A generalized approach to Supervised Meta-blocking for scalable entity resolution

Entity Resolution (ER) constitutes a core data integration task that relies on Blocking in order to tame its quadratic time complexity. Schema-agnostic blocking achieves very high recall, requires no domain knowledge and applies to data of any structuredness and schema heterogeneity. This comes at the cost of many irrelevant candidate pairs (i.e., comparisons), which can be significantly reduced through Meta-blocking techniques, i.e., techniques that leverage the co-occurrence patterns of entities inside the blocks: first, a weighting scheme assigns a score to every pair of candidate entities in proportion to the likelihood that they are matching and then, a pruning algorithm discards the pairs with the lowest scores. Supervised Meta-blocking goes beyond this approach by combining multiple scores per comparison into a feature vector that is fed to a binary classifier. By using probabilistic classifiers, Generalized Supervised Meta-blocking associates every pair of candidates with a score that can be used: (i) by any pruning algorithm for retaining the set of candidate comparisons; and (ii) by state-of-the-art progressive ER methods to identify the most promising candidates as early as possible (when time is a critical component for the downstream applications that consume the data). For higher effectiveness, new weighting schemes are examined as features. Through an extensive experimental analysis, we identify the best pruning algorithms, their optimal sets of features as well as the minimum possible size of the training set. The resulting approaches achieve excellent performance across several established benchmark datasets.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Information Systems 工程技术-计算机：信息系统

CiteScore

9.40

自引率

2.70%

发文量

112

审稿时长

53 days

期刊介绍： Information systems are the software and hardware systems that support data-intensive applications. The journal Information Systems publishes articles concerning the design and implementation of languages, data models, process models, algorithms, software and hardware for information systems. Subject areas include data management issues as presented in the principal international database conferences (e.g., ACM SIGMOD/PODS, VLDB, ICDE and ICDT/EDBT) as well as data-related issues from the fields of data mining/machine learning, information retrieval coordinated with structured data, internet and cloud data management, business process management, web semantics, visual and audio information systems, scientific computing, and data science. Implementation papers having to do with massively parallel data management, fault tolerance in practice, and special purpose hardware for data-intensive systems are also welcome. Manuscripts from application domains, such as urban informatics, social and natural science, and Internet of Things, are also welcome. All papers should highlight innovative solutions to data management problems such as new data models, performance enhancements, and show how those innovations contribute to the goals of the application.