{"title":"SMART-TV:一个快速和可扩展的基于最近邻的数据挖掘分类器","authors":"T. Abidin, W. Perrizo","doi":"10.1145/1141277.1141403","DOIUrl":null,"url":null,"abstract":"K-nearest neighbors (KNN) is the simplest method for classification. Given a set of objects in a multi-dimensional feature space, the method assigns a category to an unclassified object based on the plurality of category of the k-nearest neighbors. The closeness between objects is determined using a distance measure, e.g. Euclidian distance. Despite its simplicity, KNN also has some drawbacks: 1) it suffers from expensive computational cost in training when the training set contains millions of objects; 2) its classification time is linear to the size of the training set. The larger the training set, the longer it takes to search for the k-nearest neighbors. In this paper, we propose a new algorithm, called SMART-TV (Small Absolute difference of Total Variation), that approximates a set of potential candidates of nearest neighbors by examining the absolute difference of total variation between each data object in the training set and the unclassified object. Then, the k-nearest neighbors are searched from that candidate set. We empirically evaluate the performance of our algorithm on both real and synthetic datasets and find that SMART-TV is fast and scalable. The classification accuracy of SMART-TV is high and comparable to the accuracy of the traditional KNN algorithm.","PeriodicalId":269830,"journal":{"name":"Proceedings of the 2006 ACM symposium on Applied computing","volume":"31 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2006-04-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"36","resultStr":"{\"title\":\"SMART-TV: a fast and scalable nearest neighbor based classifier for data mining\",\"authors\":\"T. Abidin, W. Perrizo\",\"doi\":\"10.1145/1141277.1141403\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"K-nearest neighbors (KNN) is the simplest method for classification. Given a set of objects in a multi-dimensional feature space, the method assigns a category to an unclassified object based on the plurality of category of the k-nearest neighbors. The closeness between objects is determined using a distance measure, e.g. Euclidian distance. Despite its simplicity, KNN also has some drawbacks: 1) it suffers from expensive computational cost in training when the training set contains millions of objects; 2) its classification time is linear to the size of the training set. The larger the training set, the longer it takes to search for the k-nearest neighbors. In this paper, we propose a new algorithm, called SMART-TV (Small Absolute difference of Total Variation), that approximates a set of potential candidates of nearest neighbors by examining the absolute difference of total variation between each data object in the training set and the unclassified object. Then, the k-nearest neighbors are searched from that candidate set. We empirically evaluate the performance of our algorithm on both real and synthetic datasets and find that SMART-TV is fast and scalable. The classification accuracy of SMART-TV is high and comparable to the accuracy of the traditional KNN algorithm.\",\"PeriodicalId\":269830,\"journal\":{\"name\":\"Proceedings of the 2006 ACM symposium on Applied computing\",\"volume\":\"31 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2006-04-23\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"36\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the 2006 ACM symposium on Applied computing\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/1141277.1141403\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 2006 ACM symposium on Applied computing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/1141277.1141403","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 36
摘要
k近邻(KNN)是最简单的分类方法。该方法在多维特征空间中给定一组对象,根据其k近邻的类别个数为未分类对象分配类别。物体之间的距离是用距离度量来确定的,例如欧几里得距离。尽管简单,但KNN也有一些缺点:1)当训练集包含数百万个对象时,它的训练计算成本昂贵;2)其分类时间与训练集的大小成线性关系。训练集越大,搜索k个最近邻所需的时间就越长。在本文中,我们提出了一种新的算法,称为SMART-TV (Small Absolute difference of Total Variation),它通过检查训练集中每个数据对象与未分类对象之间的总变化的绝对差来近似一组最近邻的潜在候选对象。然后,从候选集合中搜索k个最近的邻居。我们对算法在真实数据集和合成数据集上的性能进行了实证评估,发现SMART-TV快速且可扩展。SMART-TV的分类精度高,可与传统KNN算法的分类精度相媲美。
SMART-TV: a fast and scalable nearest neighbor based classifier for data mining
K-nearest neighbors (KNN) is the simplest method for classification. Given a set of objects in a multi-dimensional feature space, the method assigns a category to an unclassified object based on the plurality of category of the k-nearest neighbors. The closeness between objects is determined using a distance measure, e.g. Euclidian distance. Despite its simplicity, KNN also has some drawbacks: 1) it suffers from expensive computational cost in training when the training set contains millions of objects; 2) its classification time is linear to the size of the training set. The larger the training set, the longer it takes to search for the k-nearest neighbors. In this paper, we propose a new algorithm, called SMART-TV (Small Absolute difference of Total Variation), that approximates a set of potential candidates of nearest neighbors by examining the absolute difference of total variation between each data object in the training set and the unclassified object. Then, the k-nearest neighbors are searched from that candidate set. We empirically evaluate the performance of our algorithm on both real and synthetic datasets and find that SMART-TV is fast and scalable. The classification accuracy of SMART-TV is high and comparable to the accuracy of the traditional KNN algorithm.