Open benchmark for filtering techniques in entity resolution

Franziska Neuhof, Marco Fisichella, George Papadakis, Konstantinos Nikoletos, Nikolaus Augsten, Wolfgang Nejdl, Manolis Koubarakis
{"title":"Open benchmark for filtering techniques in entity resolution","authors":"Franziska Neuhof, Marco Fisichella, George Papadakis, Konstantinos Nikoletos, Nikolaus Augsten, Wolfgang Nejdl, Manolis Koubarakis","doi":"10.1007/s00778-024-00868-7","DOIUrl":null,"url":null,"abstract":"<p>Entity Resolution identifies entity profiles that represent the same real-world object. A brute-force approach that considers all pairs of entities suffers from quadratic time complexity. To ameliorate this issue, filtering techniques reduce the search space to highly similar and, thus, highly likely matches. Such techniques come in two forms: (i) <i>blocking workflows</i> group together entity profiles with identical or similar signatures, and (ii) <i>nearest-neighbor workflows</i> convert all entity profiles into vectors and detect the ones closest to every query entity. The main techniques of these two types have never been juxtaposed in a systematic way and, thus, their relative performance is unknown. To cover this gap, we perform an extensive experimental study that investigates the relative performance of the main representatives per type over numerous established datasets. Comparing techniques of different types in a fair way is a non-trivial task, because the configuration parameters of each approach have a significant impact on its performance, but are hard to fine-tune. We consider a plethora of parameter configurations per methods, optimizing each workflow with respect to recall and precision in both schema-agnostic and schema-aware settings. The experimental results provide novel insights into the effectiveness, the time efficiency, the memory footprint, and the scalability of the considered techniques.</p>","PeriodicalId":501532,"journal":{"name":"The VLDB Journal","volume":"83 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-07-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"The VLDB Journal","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1007/s00778-024-00868-7","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

Entity Resolution identifies entity profiles that represent the same real-world object. A brute-force approach that considers all pairs of entities suffers from quadratic time complexity. To ameliorate this issue, filtering techniques reduce the search space to highly similar and, thus, highly likely matches. Such techniques come in two forms: (i) blocking workflows group together entity profiles with identical or similar signatures, and (ii) nearest-neighbor workflows convert all entity profiles into vectors and detect the ones closest to every query entity. The main techniques of these two types have never been juxtaposed in a systematic way and, thus, their relative performance is unknown. To cover this gap, we perform an extensive experimental study that investigates the relative performance of the main representatives per type over numerous established datasets. Comparing techniques of different types in a fair way is a non-trivial task, because the configuration parameters of each approach have a significant impact on its performance, but are hard to fine-tune. We consider a plethora of parameter configurations per methods, optimizing each workflow with respect to recall and precision in both schema-agnostic and schema-aware settings. The experimental results provide novel insights into the effectiveness, the time efficiency, the memory footprint, and the scalability of the considered techniques.

Abstract Image

实体解析中过滤技术的公开基准
实体解析可识别代表同一现实世界对象的实体概况。考虑所有实体配对的粗暴方法会导致时间复杂度达到二次方。为了改善这一问题,过滤技术可将搜索空间缩小到高度相似的匹配对象上,因此也就缩小了匹配的可能性。此类技术有两种形式:(i) 阻塞工作流将具有相同或相似特征的实体特征组合在一起,(ii) 最近邻工作流将所有实体特征转换成向量,并检测与每个查询实体最接近的实体特征。这两种类型的主要技术从未以系统的方式并列过,因此,它们的相对性能尚不清楚。为了弥补这一不足,我们进行了一项广泛的实验研究,通过大量已建立的数据集来调查每种类型的主要代表技术的相对性能。对不同类型的技术进行公平比较并非易事,因为每种方法的配置参数都会对其性能产生重大影响,但却很难进行微调。我们考虑了每种方法的大量参数配置,在模式识别和模式识别设置中优化了每种工作流程的召回率和精确度。实验结果为所考虑技术的有效性、时间效率、内存占用和可扩展性提供了新的见解。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信