{"title":"单调分类的新算法","authors":"Yufei Tao, Yu Wang","doi":"10.1145/3452021.3458324","DOIUrl":null,"url":null,"abstract":"In \\em monotone classification, the input is a set P of n points in d-dimensional space, where each point carries a label 0 or 1. A point p \\em dominates another point q if the coordinate of p is at least that of q on every dimension. A \\em monotone classifier is a function h mapping each d-dimensional point to $\\0, 1\\ $, subject to the condition that $h(p) \\ge h(q)$ holds whenever p dominates q. The classifier h \\em mis-classifies a point $p \\in P$ if $h(p)$ is different from the label of p. The \\em error of h is the number of points in P mis-classified by h. The objective is to find a monotone classifier with a small error. The problem is fundamental to numerous database applications in entity matching, record linkage, and duplicate detection. This paper studies two variants of the problem. In the first \\em active version, all the labels are hidden in the beginning; an algorithm must pay a unit cost to \\em probe (i.e., reveal) the label of a point in P. We prove that $Ømega(n)$ probes are necessary to find an optimal classifier even in one-dimensional space ($d=1$). On the other hand, given an arbitrary $\\eps > 0$, we show how to obtain (with high probability) a monotone classifier whose error is worse than the optimum by at most a $1 + \\eps$ factor, while probing $\\tO(w/\\eps^2)$ labels, where w is the dominance width of P and $\\tO(.)$ hides a polylogarithmic factor. For constant $\\eps$, the probing cost matches an existing lower bound up to an $\\tO(1)$ factor. In the second \\em passive version, the point labels in P are explicitly given; the goal is to minimize CPU computation in finding an optimal classifier. We show that the problem can be settled in time polynomial to both d and n.","PeriodicalId":405398,"journal":{"name":"Proceedings of the 40th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems","volume":"28 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-06-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":"{\"title\":\"New Algorithms for Monotone Classification\",\"authors\":\"Yufei Tao, Yu Wang\",\"doi\":\"10.1145/3452021.3458324\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"In \\\\em monotone classification, the input is a set P of n points in d-dimensional space, where each point carries a label 0 or 1. A point p \\\\em dominates another point q if the coordinate of p is at least that of q on every dimension. A \\\\em monotone classifier is a function h mapping each d-dimensional point to $\\\\0, 1\\\\ $, subject to the condition that $h(p) \\\\ge h(q)$ holds whenever p dominates q. The classifier h \\\\em mis-classifies a point $p \\\\in P$ if $h(p)$ is different from the label of p. The \\\\em error of h is the number of points in P mis-classified by h. The objective is to find a monotone classifier with a small error. The problem is fundamental to numerous database applications in entity matching, record linkage, and duplicate detection. This paper studies two variants of the problem. In the first \\\\em active version, all the labels are hidden in the beginning; an algorithm must pay a unit cost to \\\\em probe (i.e., reveal) the label of a point in P. We prove that $Ømega(n)$ probes are necessary to find an optimal classifier even in one-dimensional space ($d=1$). On the other hand, given an arbitrary $\\\\eps > 0$, we show how to obtain (with high probability) a monotone classifier whose error is worse than the optimum by at most a $1 + \\\\eps$ factor, while probing $\\\\tO(w/\\\\eps^2)$ labels, where w is the dominance width of P and $\\\\tO(.)$ hides a polylogarithmic factor. For constant $\\\\eps$, the probing cost matches an existing lower bound up to an $\\\\tO(1)$ factor. In the second \\\\em passive version, the point labels in P are explicitly given; the goal is to minimize CPU computation in finding an optimal classifier. We show that the problem can be settled in time polynomial to both d and n.\",\"PeriodicalId\":405398,\"journal\":{\"name\":\"Proceedings of the 40th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems\",\"volume\":\"28 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2021-06-20\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"1\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the 40th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3452021.3458324\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 40th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3452021.3458324","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
In \em monotone classification, the input is a set P of n points in d-dimensional space, where each point carries a label 0 or 1. A point p \em dominates another point q if the coordinate of p is at least that of q on every dimension. A \em monotone classifier is a function h mapping each d-dimensional point to $\0, 1\ $, subject to the condition that $h(p) \ge h(q)$ holds whenever p dominates q. The classifier h \em mis-classifies a point $p \in P$ if $h(p)$ is different from the label of p. The \em error of h is the number of points in P mis-classified by h. The objective is to find a monotone classifier with a small error. The problem is fundamental to numerous database applications in entity matching, record linkage, and duplicate detection. This paper studies two variants of the problem. In the first \em active version, all the labels are hidden in the beginning; an algorithm must pay a unit cost to \em probe (i.e., reveal) the label of a point in P. We prove that $Ømega(n)$ probes are necessary to find an optimal classifier even in one-dimensional space ($d=1$). On the other hand, given an arbitrary $\eps > 0$, we show how to obtain (with high probability) a monotone classifier whose error is worse than the optimum by at most a $1 + \eps$ factor, while probing $\tO(w/\eps^2)$ labels, where w is the dominance width of P and $\tO(.)$ hides a polylogarithmic factor. For constant $\eps$, the probing cost matches an existing lower bound up to an $\tO(1)$ factor. In the second \em passive version, the point labels in P are explicitly given; the goal is to minimize CPU computation in finding an optimal classifier. We show that the problem can be settled in time polynomial to both d and n.