单调分类的新算法

Yufei Tao, Yu Wang
{"title":"单调分类的新算法","authors":"Yufei Tao, Yu Wang","doi":"10.1145/3452021.3458324","DOIUrl":null,"url":null,"abstract":"In \\em monotone classification, the input is a set P of n points in d-dimensional space, where each point carries a label 0 or 1. A point p \\em dominates another point q if the coordinate of p is at least that of q on every dimension. A \\em monotone classifier is a function h mapping each d-dimensional point to $\\0, 1\\ $, subject to the condition that $h(p) \\ge h(q)$ holds whenever p dominates q. The classifier h \\em mis-classifies a point $p \\in P$ if $h(p)$ is different from the label of p. The \\em error of h is the number of points in P mis-classified by h. The objective is to find a monotone classifier with a small error. The problem is fundamental to numerous database applications in entity matching, record linkage, and duplicate detection. This paper studies two variants of the problem. In the first \\em active version, all the labels are hidden in the beginning; an algorithm must pay a unit cost to \\em probe (i.e., reveal) the label of a point in P. We prove that $Ømega(n)$ probes are necessary to find an optimal classifier even in one-dimensional space ($d=1$). On the other hand, given an arbitrary $\\eps > 0$, we show how to obtain (with high probability) a monotone classifier whose error is worse than the optimum by at most a $1 + \\eps$ factor, while probing $\\tO(w/\\eps^2)$ labels, where w is the dominance width of P and $\\tO(.)$ hides a polylogarithmic factor. For constant $\\eps$, the probing cost matches an existing lower bound up to an $\\tO(1)$ factor. In the second \\em passive version, the point labels in P are explicitly given; the goal is to minimize CPU computation in finding an optimal classifier. We show that the problem can be settled in time polynomial to both d and n.","PeriodicalId":405398,"journal":{"name":"Proceedings of the 40th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems","volume":"28 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-06-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":"{\"title\":\"New Algorithms for Monotone Classification\",\"authors\":\"Yufei Tao, Yu Wang\",\"doi\":\"10.1145/3452021.3458324\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"In \\\\em monotone classification, the input is a set P of n points in d-dimensional space, where each point carries a label 0 or 1. A point p \\\\em dominates another point q if the coordinate of p is at least that of q on every dimension. A \\\\em monotone classifier is a function h mapping each d-dimensional point to $\\\\0, 1\\\\ $, subject to the condition that $h(p) \\\\ge h(q)$ holds whenever p dominates q. The classifier h \\\\em mis-classifies a point $p \\\\in P$ if $h(p)$ is different from the label of p. The \\\\em error of h is the number of points in P mis-classified by h. The objective is to find a monotone classifier with a small error. The problem is fundamental to numerous database applications in entity matching, record linkage, and duplicate detection. This paper studies two variants of the problem. In the first \\\\em active version, all the labels are hidden in the beginning; an algorithm must pay a unit cost to \\\\em probe (i.e., reveal) the label of a point in P. We prove that $Ømega(n)$ probes are necessary to find an optimal classifier even in one-dimensional space ($d=1$). On the other hand, given an arbitrary $\\\\eps > 0$, we show how to obtain (with high probability) a monotone classifier whose error is worse than the optimum by at most a $1 + \\\\eps$ factor, while probing $\\\\tO(w/\\\\eps^2)$ labels, where w is the dominance width of P and $\\\\tO(.)$ hides a polylogarithmic factor. For constant $\\\\eps$, the probing cost matches an existing lower bound up to an $\\\\tO(1)$ factor. In the second \\\\em passive version, the point labels in P are explicitly given; the goal is to minimize CPU computation in finding an optimal classifier. We show that the problem can be settled in time polynomial to both d and n.\",\"PeriodicalId\":405398,\"journal\":{\"name\":\"Proceedings of the 40th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems\",\"volume\":\"28 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2021-06-20\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"1\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the 40th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3452021.3458324\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 40th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3452021.3458324","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 1

摘要

在单调分类中,输入是d维空间中n个点的集合P,其中每个点带有一个标签0或1。点p \em优于另一个点q,如果p的坐标在每个维度上至少是q的坐标。一个\em单调分类器是一个函数h,将每个d维点映射到$\ 0,1 \ $,条件是当p优于q时,$h(p) \ge h(q)$成立。如果$h(p)$与p的标签不同,分类器h \em对p $中的点$p \进行错误分类。h的\em误差是p中被h错误分类的点的数量。目标是找到一个误差小的单调分类器。这个问题是实体匹配、记录链接和重复检测中许多数据库应用程序的基础。本文研究了该问题的两个变体。在第一个\em活动版本中,所有标签都隐藏在开始;我们证明了$Ømega(n)$探针对于找到最优分类器是必要的,即使在一维空间($d=1$)中。另一方面,给定任意$\eps > 0$,我们展示了如何(以高概率)获得一个单调分类器,其误差最多比最优值差$1 + \eps$因子,同时探测$\ to (w/\eps^2)$标签,其中w是P的优势宽度,$\ to(.)$隐藏了一个多对数因子。对于常数$\eps$,探测成本匹配到$\ to(1)$因子的现有下界。在第二个被动版本中,P中的点标签被显式给出;目标是在寻找最优分类器时最小化CPU计算。我们证明了这个问题可以用d和n的时间多项式来解决。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
New Algorithms for Monotone Classification
In \em monotone classification, the input is a set P of n points in d-dimensional space, where each point carries a label 0 or 1. A point p \em dominates another point q if the coordinate of p is at least that of q on every dimension. A \em monotone classifier is a function h mapping each d-dimensional point to $\0, 1\ $, subject to the condition that $h(p) \ge h(q)$ holds whenever p dominates q. The classifier h \em mis-classifies a point $p \in P$ if $h(p)$ is different from the label of p. The \em error of h is the number of points in P mis-classified by h. The objective is to find a monotone classifier with a small error. The problem is fundamental to numerous database applications in entity matching, record linkage, and duplicate detection. This paper studies two variants of the problem. In the first \em active version, all the labels are hidden in the beginning; an algorithm must pay a unit cost to \em probe (i.e., reveal) the label of a point in P. We prove that $Ømega(n)$ probes are necessary to find an optimal classifier even in one-dimensional space ($d=1$). On the other hand, given an arbitrary $\eps > 0$, we show how to obtain (with high probability) a monotone classifier whose error is worse than the optimum by at most a $1 + \eps$ factor, while probing $\tO(w/\eps^2)$ labels, where w is the dominance width of P and $\tO(.)$ hides a polylogarithmic factor. For constant $\eps$, the probing cost matches an existing lower bound up to an $\tO(1)$ factor. In the second \em passive version, the point labels in P are explicitly given; the goal is to minimize CPU computation in finding an optimal classifier. We show that the problem can be settled in time polynomial to both d and n.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信