The Four Generations of Entity Resolution

G. Papadakis, Ekaterini Ioannou, Emanouil Thanos, Themis Palpanas
{"title":"The Four Generations of Entity Resolution","authors":"G. Papadakis, Ekaterini Ioannou, Emanouil Thanos, Themis Palpanas","doi":"10.2200/S01067ED1V01Y202012DTM064","DOIUrl":null,"url":null,"abstract":"Entity Resolution (ER) lies at the core of data integration and cleaning and, thus, a bulk of the research examines ways for improving its effectiveness and time efficiency. The initial ER methods primarily target Veracity in the context of structured (relational) data that are described by a schema of well-known quality and meaning. To achieve high effectiveness, they leverage schema, expert, and/or external knowledge. Part of these methods are extended to address Volume, processing large datasets through multi-core or massive parallelization approaches, such as the MapReduce paradigm. However, these early schema-based approaches are inapplicable to Web Data, which abound in voluminous, noisy, semi-structured, and highly heterogeneous information. To address the additional challenge of Variety, recent works on ER adopt a novel, loosely schema-aware functionality that emphasizes scalability and robustness to noise. Another line of present research focuses on the additional challenge of Velocity, aiming to process data collections of a continuously increasing volume. The latest works, though, take advantage of the significant breakthroughs in Deep Learning and Crowdsourcing, incorporating external knowledge to enhance the existing words to a significant extent. This synthesis lecture organizes ER methods into four generations based on the challenges posed by these four Vs. For each generation, we outline the corresponding ER workflow, discuss the state-of-the-art methods per workflow step, and present current research directions. The discussion of these methods takes into account a historical perspective, explaining the evolution of the methods over time along with their similarities and differences. The lecture also discusses the available ER tools and benchmark datasets that allow expert as well as novice users to make use of the available solutions.","PeriodicalId":187413,"journal":{"name":"Synthesis Lectures on Data Management","volume":"20 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"1900-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"44","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Synthesis Lectures on Data Management","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.2200/S01067ED1V01Y202012DTM064","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 44

Abstract

Entity Resolution (ER) lies at the core of data integration and cleaning and, thus, a bulk of the research examines ways for improving its effectiveness and time efficiency. The initial ER methods primarily target Veracity in the context of structured (relational) data that are described by a schema of well-known quality and meaning. To achieve high effectiveness, they leverage schema, expert, and/or external knowledge. Part of these methods are extended to address Volume, processing large datasets through multi-core or massive parallelization approaches, such as the MapReduce paradigm. However, these early schema-based approaches are inapplicable to Web Data, which abound in voluminous, noisy, semi-structured, and highly heterogeneous information. To address the additional challenge of Variety, recent works on ER adopt a novel, loosely schema-aware functionality that emphasizes scalability and robustness to noise. Another line of present research focuses on the additional challenge of Velocity, aiming to process data collections of a continuously increasing volume. The latest works, though, take advantage of the significant breakthroughs in Deep Learning and Crowdsourcing, incorporating external knowledge to enhance the existing words to a significant extent. This synthesis lecture organizes ER methods into four generations based on the challenges posed by these four Vs. For each generation, we outline the corresponding ER workflow, discuss the state-of-the-art methods per workflow step, and present current research directions. The discussion of these methods takes into account a historical perspective, explaining the evolution of the methods over time along with their similarities and differences. The lecture also discusses the available ER tools and benchmark datasets that allow expert as well as novice users to make use of the available solutions.
四代实体解析
实体解析(ER)是数据集成和清理的核心,因此,大量的研究探讨了提高其有效性和时间效率的方法。最初的ER方法主要针对结构化(关系)数据上下文中的准确性,这些数据由已知质量和含义的模式描述。为了达到高效率,他们利用模式、专家和/或外部知识。这些方法的一部分被扩展到解决Volume,通过多核或大规模并行化方法处理大型数据集,例如MapReduce范例。然而,这些早期的基于模式的方法不适用于Web数据,因为Web数据中充斥着大量的、嘈杂的、半结构化的和高度异构的信息。为了应对多样性的额外挑战,最近关于ER的工作采用了一种新颖的、松散的模式感知功能,强调可伸缩性和对噪声的鲁棒性。目前的另一项研究侧重于Velocity的额外挑战,旨在处理不断增加的数据收集。然而,最新的作品利用了深度学习和众包的重大突破,结合外部知识在很大程度上增强了现有的单词。本综合讲座将基于这四个方面的挑战,将ER方法分为四代。对于每一代,我们概述了相应的ER工作流程,讨论了每个工作流程步骤的最新方法,并介绍了当前的研究方向。对这些方法的讨论考虑了历史的观点,解释了这些方法随时间的演变以及它们的异同。讲座还讨论了可用的ER工具和基准数据集,允许专家和新手用户使用可用的解决方案。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信