{"title":"主动单调分类的实体匹配","authors":"Yufei Tao","doi":"10.1145/3196959.3196984","DOIUrl":null,"url":null,"abstract":"Given two sets of entities X and Y, entity matching aims to decide whether x and y represent the same entity for each pair (x, y) ın X x Y. As the last resort, human experts can be called upon to inspect every (x, y), but this is expensive because the correct verdict could not be determined without investigation efforts dedicated specifically to the two entities x and y involved. It is therefore important to design an algorithm that asks humans to look at only some pairs, and renders the verdicts on the other pairs automatically with good accuracy. At the core of most (if not all) existing approaches is the following classification problem. The input is a set P of points in Rd, each of which carries a binary label: 0 or 1. A classifier F is a function from Rd to (0, 1). The objective is to find a classifier that captures the labels of a large number of points in P. In this paper, we cast the problem as an instance of active learning where the goal is to learn a monotone classifier F, namely, F(p) ≥ F(q) holds whenever the coordinate of p is at least that of q on all dimensions. In our formulation, the labels of all points in P are hidden at the beginning. An algorithm A can invoke an oracle, which discloses the label of a point p ın P chosen by A. The algorithm may do so repetitively, until it has garnered enough information to produce F. The cost of A is the number of times that the oracle is called. The challenge is to strike a good balance between the cost and the accuracy of the classifier produced. We describe algorithms with non-trivial guarantees on the cost and accuracy simultaneously. We also prove lower bounds that establish the asymptotic optimality of our solutions for a wide range of parameters.","PeriodicalId":344370,"journal":{"name":"Proceedings of the 37th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems","volume":"101 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2018-05-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"11","resultStr":"{\"title\":\"Entity Matching with Active Monotone Classification\",\"authors\":\"Yufei Tao\",\"doi\":\"10.1145/3196959.3196984\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Given two sets of entities X and Y, entity matching aims to decide whether x and y represent the same entity for each pair (x, y) ın X x Y. As the last resort, human experts can be called upon to inspect every (x, y), but this is expensive because the correct verdict could not be determined without investigation efforts dedicated specifically to the two entities x and y involved. It is therefore important to design an algorithm that asks humans to look at only some pairs, and renders the verdicts on the other pairs automatically with good accuracy. At the core of most (if not all) existing approaches is the following classification problem. The input is a set P of points in Rd, each of which carries a binary label: 0 or 1. A classifier F is a function from Rd to (0, 1). The objective is to find a classifier that captures the labels of a large number of points in P. In this paper, we cast the problem as an instance of active learning where the goal is to learn a monotone classifier F, namely, F(p) ≥ F(q) holds whenever the coordinate of p is at least that of q on all dimensions. In our formulation, the labels of all points in P are hidden at the beginning. An algorithm A can invoke an oracle, which discloses the label of a point p ın P chosen by A. The algorithm may do so repetitively, until it has garnered enough information to produce F. The cost of A is the number of times that the oracle is called. The challenge is to strike a good balance between the cost and the accuracy of the classifier produced. We describe algorithms with non-trivial guarantees on the cost and accuracy simultaneously. We also prove lower bounds that establish the asymptotic optimality of our solutions for a wide range of parameters.\",\"PeriodicalId\":344370,\"journal\":{\"name\":\"Proceedings of the 37th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems\",\"volume\":\"101 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2018-05-27\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"11\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the 37th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3196959.3196984\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 37th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3196959.3196984","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 11
摘要
给定两组实体X和Y,实体匹配的目的是确定X和Y是否代表每对(X, Y) ın X X Y.作为最后的手段,可以要求人类专家检查每个(X, Y),但这是昂贵的,因为如果没有专门针对所涉及的两个实体X和Y的调查工作,就无法确定正确的判决。因此,设计一种算法是很重要的,它要求人们只看一些配对,并以良好的准确性自动对其他配对做出判断。大多数(如果不是全部)现有方法的核心是以下分类问题。输入是Rd中P个点的集合,每个点都带有一个二进制标号:0或1。分类器F是一个从Rd到(0,1)的函数。目标是找到一个能够捕获p中大量点的标签的分类器。在本文中,我们将这个问题作为一个主动学习的实例,其目标是学习一个单调分类器F,即F(p)≥F(q),只要p的坐标在所有维度上至少是q的坐标。在我们的公式中,P中所有点的标签一开始都是隐藏的。算法A可以调用一个oracle,它公开A选择的点p ın p的标签。算法可以重复这样做,直到它收集到足够的信息产生f。A的代价是调用oracle的次数。面临的挑战是在成本和分类器的准确性之间取得良好的平衡。我们描述了同时对成本和准确性有重要保证的算法。我们还证明了在大范围参数下解的渐近最优性的下界。
Entity Matching with Active Monotone Classification
Given two sets of entities X and Y, entity matching aims to decide whether x and y represent the same entity for each pair (x, y) ın X x Y. As the last resort, human experts can be called upon to inspect every (x, y), but this is expensive because the correct verdict could not be determined without investigation efforts dedicated specifically to the two entities x and y involved. It is therefore important to design an algorithm that asks humans to look at only some pairs, and renders the verdicts on the other pairs automatically with good accuracy. At the core of most (if not all) existing approaches is the following classification problem. The input is a set P of points in Rd, each of which carries a binary label: 0 or 1. A classifier F is a function from Rd to (0, 1). The objective is to find a classifier that captures the labels of a large number of points in P. In this paper, we cast the problem as an instance of active learning where the goal is to learn a monotone classifier F, namely, F(p) ≥ F(q) holds whenever the coordinate of p is at least that of q on all dimensions. In our formulation, the labels of all points in P are hidden at the beginning. An algorithm A can invoke an oracle, which discloses the label of a point p ın P chosen by A. The algorithm may do so repetitively, until it has garnered enough information to produce F. The cost of A is the number of times that the oracle is called. The challenge is to strike a good balance between the cost and the accuracy of the classifier produced. We describe algorithms with non-trivial guarantees on the cost and accuracy simultaneously. We also prove lower bounds that establish the asymptotic optimality of our solutions for a wide range of parameters.