在mapreduce框架中使用位置敏感散列改进KNN算法

Proceedings of the ACMSE 2018 Conference Pub Date : 2018-03-29 DOI:10.1145/3190645.3190700

S. Bagui, A. Mondal, S. Bagui

{"title":"在mapreduce框架中使用位置敏感散列改进KNN算法","authors":"S. Bagui, A. Mondal, S. Bagui","doi":"10.1145/3190645.3190700","DOIUrl":null,"url":null,"abstract":"The K-Nearest Neighbor! (KNN) algorithm is one of the most widely used algorithms in data mining for classification and prediction. The algorithm has several applications: in facial detection when used with deep learning, in biometric security applications etc. The traditional KNN algorithm involves an iterative process of computing the distance between a test data point and every data point in the training dataset, and classifying the object based on the closest training sample. This method first selects K nearest training data points for classifying a test data point and then predicts the test sample's class based on the majority class among those neighbors. If both the train and test datasets are large, this conventional form can be considered computationally expensive. Reduction of the massive calculation that is required to predict a data vector was our main goal, and with this intention, the training dataset was split into several buckets. The KNN algorithm was then performed inside a bucket, instead of iterating over the whole training dataset. We used the Jaccard Coefficient to determine the degree of similarity of a data vector with some arbitrarily defined data points P and placed similar data points in the same bucket. This was the core functionality of our hash function. The hash function determines the bucket number where the similar data vectors will be placed. Unlike the standard hashing algorithm, our approach of hashing was to maximize the probability of the hash collision to preserve the locality sensitiveness. Both the conventional and proposed methods were implemented in Hadoop's MapReduce framework. Hadoop gives us an architecture for handling large datasets on a computer cluster in a distributed manner and gives us massive scalability. The use of the locality sensitive hashing in KNN in Hadoop's MapReduce environment took less time than conventional KNN to classify a new data object.","PeriodicalId":403177,"journal":{"name":"Proceedings of the ACMSE 2018 Conference","volume":"29 11 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2018-03-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Using locality sensitive hashing to improve the KNN algorithm in the mapreduce framework\",\"authors\":\"S. Bagui, A. Mondal, S. Bagui\",\"doi\":\"10.1145/3190645.3190700\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"The K-Nearest Neighbor! (KNN) algorithm is one of the most widely used algorithms in data mining for classification and prediction. The algorithm has several applications: in facial detection when used with deep learning, in biometric security applications etc. The traditional KNN algorithm involves an iterative process of computing the distance between a test data point and every data point in the training dataset, and classifying the object based on the closest training sample. This method first selects K nearest training data points for classifying a test data point and then predicts the test sample's class based on the majority class among those neighbors. If both the train and test datasets are large, this conventional form can be considered computationally expensive. Reduction of the massive calculation that is required to predict a data vector was our main goal, and with this intention, the training dataset was split into several buckets. The KNN algorithm was then performed inside a bucket, instead of iterating over the whole training dataset. We used the Jaccard Coefficient to determine the degree of similarity of a data vector with some arbitrarily defined data points P and placed similar data points in the same bucket. This was the core functionality of our hash function. The hash function determines the bucket number where the similar data vectors will be placed. Unlike the standard hashing algorithm, our approach of hashing was to maximize the probability of the hash collision to preserve the locality sensitiveness. Both the conventional and proposed methods were implemented in Hadoop's MapReduce framework. Hadoop gives us an architecture for handling large datasets on a computer cluster in a distributed manner and gives us massive scalability. The use of the locality sensitive hashing in KNN in Hadoop's MapReduce environment took less time than conventional KNN to classify a new data object.\",\"PeriodicalId\":403177,\"journal\":{\"name\":\"Proceedings of the ACMSE 2018 Conference\",\"volume\":\"29 11 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2018-03-29\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the ACMSE 2018 Conference\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3190645.3190700\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the ACMSE 2018 Conference","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3190645.3190700","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

k近邻!KNN算法是数据挖掘中应用最广泛的分类和预测算法之一。该算法有几个应用:与深度学习一起使用的面部检测，生物识别安全应用等。传统的KNN算法是一个迭代过程，计算测试数据点与训练数据集中每个数据点之间的距离，并根据最接近的训练样本对目标进行分类。该方法首先选择K个最近的训练数据点用于对测试数据点进行分类，然后根据这些邻居中的多数类预测测试样本的类。如果训练数据集和测试数据集都很大，那么这种传统的形式在计算上可能会很昂贵。减少预测数据向量所需的大量计算是我们的主要目标，出于这个目的，我们将训练数据集分成几个桶。KNN算法然后在一个桶内执行，而不是在整个训练数据集上迭代。我们使用Jaccard系数来确定数据向量与一些任意定义的数据点P的相似程度，并将相似的数据点放在同一桶中。这是我们哈希函数的核心功能。哈希函数确定将放置相似数据向量的桶号。与标准哈希算法不同，我们的哈希方法是最大化哈希冲突的概率，以保持局部敏感性。传统方法和提出的方法都在Hadoop的MapReduce框架中实现。Hadoop为我们提供了一个以分布式方式处理计算机集群上的大型数据集的架构，并为我们提供了巨大的可扩展性。在Hadoop的MapReduce环境中，使用KNN中的位置敏感散列比传统KNN对新数据对象进行分类所需的时间更短。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Using locality sensitive hashing to improve the KNN algorithm in the mapreduce framework

The K-Nearest Neighbor! (KNN) algorithm is one of the most widely used algorithms in data mining for classification and prediction. The algorithm has several applications: in facial detection when used with deep learning, in biometric security applications etc. The traditional KNN algorithm involves an iterative process of computing the distance between a test data point and every data point in the training dataset, and classifying the object based on the closest training sample. This method first selects K nearest training data points for classifying a test data point and then predicts the test sample's class based on the majority class among those neighbors. If both the train and test datasets are large, this conventional form can be considered computationally expensive. Reduction of the massive calculation that is required to predict a data vector was our main goal, and with this intention, the training dataset was split into several buckets. The KNN algorithm was then performed inside a bucket, instead of iterating over the whole training dataset. We used the Jaccard Coefficient to determine the degree of similarity of a data vector with some arbitrarily defined data points P and placed similar data points in the same bucket. This was the core functionality of our hash function. The hash function determines the bucket number where the similar data vectors will be placed. Unlike the standard hashing algorithm, our approach of hashing was to maximize the probability of the hash collision to preserve the locality sensitiveness. Both the conventional and proposed methods were implemented in Hadoop's MapReduce framework. Hadoop gives us an architecture for handling large datasets on a computer cluster in a distributed manner and gives us massive scalability. The use of the locality sensitive hashing in KNN in Hadoop's MapReduce environment took less time than conventional KNN to classify a new data object.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Proceedings of the ACMSE 2018 Conference

自引率

0.00%

发文量