基于低维空间线性分类的大规模并行实体匹配

Database theory-- ICDT : International Conference ... proceedings. International Conference on Database Theory Pub Date : 2018-01-01 DOI:10.4230/LIPIcs.ICDT.2018.20

Yufei Tao

{"title":"基于低维空间线性分类的大规模并行实体匹配","authors":"Yufei Tao","doi":"10.4230/LIPIcs.ICDT.2018.20","DOIUrl":null,"url":null,"abstract":"In entity matching classification, we are given two sets R and S of objects where whether r and s form a match is known for each pair (r, s) in R x S. If R and S are subsets of domains D(R) and D(S) respectively, the goal is to discover a classifier function f: D(R) x D(S) -> {0, 1} from a certain class satisfying the property that, for every (r, s) in R x S, f(r, s) = 1 if and only if r and s are a match. Past research is accustomed to running a learning algorithm directly on all the labeled (i.e., match or not) pairs in R times S. This, however, suffers from the drawback that even reading through the input incurs a quadratic cost. We pursue a direction towards removing the quadratic barrier. Denote by T the set of matching pairs in R times S. We propose to accept R, S, and T as the input, and aim to solve the problem with cost proportional to |R|+|S|+|T|, thereby achieving a large performance gain in the (typical) scenario where |T|<<|R||S|. This paper provides evidence on the feasibility of the new direction, by showing how to accomplish the aforementioned purpose for entity matching with linear classification, where a classifier is a linear multi-dimensional plane separating the matching and non-matching pairs. We actually do so in the MPC model, echoing the trend of deploying massively parallel computing systems for large-scale learning. As a side product, we obtain new MPC algorithms for three geometric problems: linear programming, batched range counting, and dominance join.","PeriodicalId":90482,"journal":{"name":"Database theory-- ICDT : International Conference ... proceedings. International Conference on Database Theory","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2018-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"9","resultStr":"{\"title\":\"Massively Parallel Entity Matching with Linear Classification in Low Dimensional Space\",\"authors\":\"Yufei Tao\",\"doi\":\"10.4230/LIPIcs.ICDT.2018.20\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"In entity matching classification, we are given two sets R and S of objects where whether r and s form a match is known for each pair (r, s) in R x S. If R and S are subsets of domains D(R) and D(S) respectively, the goal is to discover a classifier function f: D(R) x D(S) -> {0, 1} from a certain class satisfying the property that, for every (r, s) in R x S, f(r, s) = 1 if and only if r and s are a match. Past research is accustomed to running a learning algorithm directly on all the labeled (i.e., match or not) pairs in R times S. This, however, suffers from the drawback that even reading through the input incurs a quadratic cost. We pursue a direction towards removing the quadratic barrier. Denote by T the set of matching pairs in R times S. We propose to accept R, S, and T as the input, and aim to solve the problem with cost proportional to |R|+|S|+|T|, thereby achieving a large performance gain in the (typical) scenario where |T|<<|R||S|. This paper provides evidence on the feasibility of the new direction, by showing how to accomplish the aforementioned purpose for entity matching with linear classification, where a classifier is a linear multi-dimensional plane separating the matching and non-matching pairs. We actually do so in the MPC model, echoing the trend of deploying massively parallel computing systems for large-scale learning. As a side product, we obtain new MPC algorithms for three geometric problems: linear programming, batched range counting, and dominance join.\",\"PeriodicalId\":90482,\"journal\":{\"name\":\"Database theory-- ICDT : International Conference ... proceedings. International Conference on Database Theory\",\"volume\":null,\"pages\":null},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2018-01-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"9\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Database theory-- ICDT : International Conference ... proceedings. International Conference on Database Theory\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.4230/LIPIcs.ICDT.2018.20\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Database theory-- ICDT : International Conference ... proceedings. International Conference on Database Theory","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.4230/LIPIcs.ICDT.2018.20","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 9

摘要

在实体匹配的分类,我们给出两套R和S的对象是否R和S形式以每一对匹配(R, S)在x R S .如果R和S是域的子集(R)和D (S)分别的目标是发现一个分类器函数f: D (R) x D (S) - >{0,1}从某个类的属性,每一个在R (R, S) x年代,f (R, S) = 1当且仅当R和S是匹配。过去的研究习惯于直接在R乘以s的所有标记(即匹配或不匹配)对上运行学习算法，然而，这存在一个缺点，即即使读取输入也会产生二次成本。我们追求一个消除二次势垒的方向。用T表示R乘以S的匹配对的集合。我们建议接受R、S、T作为输入，以|R|+|S|+|T|为代价来解决问题，从而在|T|<<|R||S|的(典型)场景中获得较大的性能提升。本文通过展示如何用线性分类实现实体匹配的上述目的，为新方向的可行性提供了证据，其中分类器是分离匹配对和不匹配对的线性多维平面。我们实际上是在MPC模型中这样做的，这与为大规模学习部署大规模并行计算系统的趋势相呼应。作为副产物，我们得到了三个几何问题的新的MPC算法:线性规划、批处理范围计数和优势连接。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Massively Parallel Entity Matching with Linear Classification in Low Dimensional Space

In entity matching classification, we are given two sets R and S of objects where whether r and s form a match is known for each pair (r, s) in R x S. If R and S are subsets of domains D(R) and D(S) respectively, the goal is to discover a classifier function f: D(R) x D(S) -> {0, 1} from a certain class satisfying the property that, for every (r, s) in R x S, f(r, s) = 1 if and only if r and s are a match. Past research is accustomed to running a learning algorithm directly on all the labeled (i.e., match or not) pairs in R times S. This, however, suffers from the drawback that even reading through the input incurs a quadratic cost. We pursue a direction towards removing the quadratic barrier. Denote by T the set of matching pairs in R times S. We propose to accept R, S, and T as the input, and aim to solve the problem with cost proportional to |R|+|S|+|T|, thereby achieving a large performance gain in the (typical) scenario where |T|<<|R||S|. This paper provides evidence on the feasibility of the new direction, by showing how to accomplish the aforementioned purpose for entity matching with linear classification, where a classifier is a linear multi-dimensional plane separating the matching and non-matching pairs. We actually do so in the MPC model, echoing the trend of deploying massively parallel computing systems for large-scale learning. As a side product, we obtain new MPC algorithms for three geometric problems: linear programming, batched range counting, and dominance join.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Database theory-- ICDT : International Conference ... proceedings. International Conference on Database Theory

自引率

0.00%

发文量