Sparkly: A Simple yet Surprisingly Strong TF/IDF Blocker for Entity Matching

Proc. VLDB Endow. Pub Date : 2023-02-01 DOI:10.14778/3583140.3583163

Derek Paulsen, Yash Govind, A. Doan

{"title":"Sparkly: A Simple yet Surprisingly Strong TF/IDF Blocker for Entity Matching","authors":"Derek Paulsen, Yash Govind, A. Doan","doi":"10.14778/3583140.3583163","DOIUrl":null,"url":null,"abstract":"Blocking is a major task in entity matching. Numerous blocking solutions have been developed, but as far as we can tell, blocking using the well-known tf/idf measure has received virtually no attention. Yet, when we experimented with tf/idf blocking using Lucene, we found it did quite well. So in this paper we examine tf/idf blocking in depth. We develop Sparkly, which uses Lucene to perform top-k tf/idf blocking in a distributed share-nothing fashion on a Spark cluster. We develop techniques to identify good attributes and tokenizers that can be used to block on, making Sparkly completely automatic. We perform extensive experiments showing that Sparkly outperforms 8 state-of-the-art blockers. Finally, we provide an in-depth analysis of Sparkly's performance, regarding both recall/output size and runtime. Our findings suggest that (a) tf/idf blocking needs more attention, (b) Sparkly forms a strong baseline that future blocking work should compare against, and (c) future blocking work should seriously consider top-k blocking, which helps improve recall, and a distributed share-nothing architecture, which helps improve scalability, predictability, and extensibility.","PeriodicalId":20467,"journal":{"name":"Proc. VLDB Endow.","volume":"28 1","pages":"1507-1519"},"PeriodicalIF":0.0000,"publicationDate":"2023-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"3","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proc. VLDB Endow.","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.14778/3583140.3583163","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 3

Abstract

Blocking is a major task in entity matching. Numerous blocking solutions have been developed, but as far as we can tell, blocking using the well-known tf/idf measure has received virtually no attention. Yet, when we experimented with tf/idf blocking using Lucene, we found it did quite well. So in this paper we examine tf/idf blocking in depth. We develop Sparkly, which uses Lucene to perform top-k tf/idf blocking in a distributed share-nothing fashion on a Spark cluster. We develop techniques to identify good attributes and tokenizers that can be used to block on, making Sparkly completely automatic. We perform extensive experiments showing that Sparkly outperforms 8 state-of-the-art blockers. Finally, we provide an in-depth analysis of Sparkly's performance, regarding both recall/output size and runtime. Our findings suggest that (a) tf/idf blocking needs more attention, (b) Sparkly forms a strong baseline that future blocking work should compare against, and (c) future blocking work should seriously consider top-k blocking, which helps improve recall, and a distributed share-nothing architecture, which helps improve scalability, predictability, and extensibility.

查看原文本刊更多论文

Sparkly:用于实体匹配的简单但令人惊讶的强大TF/IDF拦截器

阻塞是实体匹配中的一项重要任务。已经开发了许多阻塞解决方案，但据我们所知，使用众所周知的tf/idf措施进行阻塞实际上没有受到任何关注。然而，当我们使用Lucene对tf/idf阻塞进行实验时，我们发现它做得很好。因此，在本文中，我们深入研究了tf/idf阻塞。我们开发了Spark，它使用Lucene在Spark集群上以分布式无共享的方式执行top-k tf/idf阻塞。我们开发了一些技术来识别好的属性和标记器，这些属性和标记器可以用来阻塞，使spark完全自动化。我们进行了大量的实验，表明“火花”比8种最先进的阻滞剂效果更好。最后，我们对Sparkly的性能进行了深入分析，包括召回/输出大小和运行时间。我们的研究结果表明(a) tf/idf阻塞需要更多的关注，(b) spark形成了一个强大的基线，未来的阻塞工作应该与之比较，(c)未来的阻塞工作应该认真考虑top-k阻塞，这有助于提高召回率，以及分布式无共享架构，这有助于提高可扩展性，可预测性和可扩展性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Proc. VLDB Endow.

自引率

0.00%

发文量