{"title":"在mapreduce框架中使用位置敏感散列改进KNN算法","authors":"S. Bagui, A. Mondal, S. Bagui","doi":"10.1145/3190645.3190700","DOIUrl":null,"url":null,"abstract":"The K-Nearest Neighbor! (KNN) algorithm is one of the most widely used algorithms in data mining for classification and prediction. The algorithm has several applications: in facial detection when used with deep learning, in biometric security applications etc. The traditional KNN algorithm involves an iterative process of computing the distance between a test data point and every data point in the training dataset, and classifying the object based on the closest training sample. This method first selects K nearest training data points for classifying a test data point and then predicts the test sample's class based on the majority class among those neighbors. If both the train and test datasets are large, this conventional form can be considered computationally expensive. Reduction of the massive calculation that is required to predict a data vector was our main goal, and with this intention, the training dataset was split into several buckets. The KNN algorithm was then performed inside a bucket, instead of iterating over the whole training dataset. We used the Jaccard Coefficient to determine the degree of similarity of a data vector with some arbitrarily defined data points P and placed similar data points in the same bucket. This was the core functionality of our hash function. The hash function determines the bucket number where the similar data vectors will be placed. Unlike the standard hashing algorithm, our approach of hashing was to maximize the probability of the hash collision to preserve the locality sensitiveness. Both the conventional and proposed methods were implemented in Hadoop's MapReduce framework. Hadoop gives us an architecture for handling large datasets on a computer cluster in a distributed manner and gives us massive scalability. The use of the locality sensitive hashing in KNN in Hadoop's MapReduce environment took less time than conventional KNN to classify a new data object.","PeriodicalId":403177,"journal":{"name":"Proceedings of the ACMSE 2018 Conference","volume":"29 11 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2018-03-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Using locality sensitive hashing to improve the KNN algorithm in the mapreduce framework\",\"authors\":\"S. Bagui, A. Mondal, S. Bagui\",\"doi\":\"10.1145/3190645.3190700\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"The K-Nearest Neighbor! (KNN) algorithm is one of the most widely used algorithms in data mining for classification and prediction. The algorithm has several applications: in facial detection when used with deep learning, in biometric security applications etc. The traditional KNN algorithm involves an iterative process of computing the distance between a test data point and every data point in the training dataset, and classifying the object based on the closest training sample. This method first selects K nearest training data points for classifying a test data point and then predicts the test sample's class based on the majority class among those neighbors. If both the train and test datasets are large, this conventional form can be considered computationally expensive. Reduction of the massive calculation that is required to predict a data vector was our main goal, and with this intention, the training dataset was split into several buckets. The KNN algorithm was then performed inside a bucket, instead of iterating over the whole training dataset. We used the Jaccard Coefficient to determine the degree of similarity of a data vector with some arbitrarily defined data points P and placed similar data points in the same bucket. This was the core functionality of our hash function. The hash function determines the bucket number where the similar data vectors will be placed. Unlike the standard hashing algorithm, our approach of hashing was to maximize the probability of the hash collision to preserve the locality sensitiveness. Both the conventional and proposed methods were implemented in Hadoop's MapReduce framework. Hadoop gives us an architecture for handling large datasets on a computer cluster in a distributed manner and gives us massive scalability. The use of the locality sensitive hashing in KNN in Hadoop's MapReduce environment took less time than conventional KNN to classify a new data object.\",\"PeriodicalId\":403177,\"journal\":{\"name\":\"Proceedings of the ACMSE 2018 Conference\",\"volume\":\"29 11 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2018-03-29\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the ACMSE 2018 Conference\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3190645.3190700\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the ACMSE 2018 Conference","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3190645.3190700","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Using locality sensitive hashing to improve the KNN algorithm in the mapreduce framework
The K-Nearest Neighbor! (KNN) algorithm is one of the most widely used algorithms in data mining for classification and prediction. The algorithm has several applications: in facial detection when used with deep learning, in biometric security applications etc. The traditional KNN algorithm involves an iterative process of computing the distance between a test data point and every data point in the training dataset, and classifying the object based on the closest training sample. This method first selects K nearest training data points for classifying a test data point and then predicts the test sample's class based on the majority class among those neighbors. If both the train and test datasets are large, this conventional form can be considered computationally expensive. Reduction of the massive calculation that is required to predict a data vector was our main goal, and with this intention, the training dataset was split into several buckets. The KNN algorithm was then performed inside a bucket, instead of iterating over the whole training dataset. We used the Jaccard Coefficient to determine the degree of similarity of a data vector with some arbitrarily defined data points P and placed similar data points in the same bucket. This was the core functionality of our hash function. The hash function determines the bucket number where the similar data vectors will be placed. Unlike the standard hashing algorithm, our approach of hashing was to maximize the probability of the hash collision to preserve the locality sensitiveness. Both the conventional and proposed methods were implemented in Hadoop's MapReduce framework. Hadoop gives us an architecture for handling large datasets on a computer cluster in a distributed manner and gives us massive scalability. The use of the locality sensitive hashing in KNN in Hadoop's MapReduce environment took less time than conventional KNN to classify a new data object.