大数据集的精确模糊k近邻分类

2017 IEEE International Conference on Fuzzy Systems (FUZZ-IEEE) Pub Date : 2017-07-10 DOI:10.1109/FUZZ-IEEE.2017.8015686

Jesús Maillo, J. Luengo, S. García, F. Herrera, I. Triguero

{"title":"大数据集的精确模糊k近邻分类","authors":"Jesús Maillo, J. Luengo, S. García, F. Herrera, I. Triguero","doi":"10.1109/FUZZ-IEEE.2017.8015686","DOIUrl":null,"url":null,"abstract":"The k-Nearest Neighbors (kNN) classifier is one of the most effective methods in supervised learning problems. It classifies unseen cases comparing their similarity with the training data. Nevertheless, it gives to each labeled sample the same importance to classify. There are several approaches to enhance its precision, with the Fuzzy k-Nearest Neighbors (Fuzzy-kNN) classifier being among the most successful ones. Fuzzy-kNN computes a fuzzy degree of membership of each instance to the classes of the problem. As a result, it generates smoother borders between classes. Apart from the existing kNN approach to handle big datasets, there is not a fuzzy variant to manage that volume of data. Nevertheless, calculating this class membership adds an extra computational cost becoming even less scalable to tackle large datasets because of memory needs and high runtime. In this work, we present an exact and distributed approach to run the Fuzzy-kNN classifier on big datasets based on Spark, which provides the same precision than the original algorithm. It presents two separately stages. The first stage transforms the training set adding the class membership degrees. The second stage classifies with the kNN algorithm the test set using the class membership computed previously. In our experiments, we study the scaling-up capabilities of the proposed approach with datasets up to 11 million instances, showing promising results.","PeriodicalId":408343,"journal":{"name":"2017 IEEE International Conference on Fuzzy Systems (FUZZ-IEEE)","volume":"49 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2017-07-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"20","resultStr":"{\"title\":\"Exact fuzzy k-nearest neighbor classification for big datasets\",\"authors\":\"Jesús Maillo, J. Luengo, S. García, F. Herrera, I. Triguero\",\"doi\":\"10.1109/FUZZ-IEEE.2017.8015686\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"The k-Nearest Neighbors (kNN) classifier is one of the most effective methods in supervised learning problems. It classifies unseen cases comparing their similarity with the training data. Nevertheless, it gives to each labeled sample the same importance to classify. There are several approaches to enhance its precision, with the Fuzzy k-Nearest Neighbors (Fuzzy-kNN) classifier being among the most successful ones. Fuzzy-kNN computes a fuzzy degree of membership of each instance to the classes of the problem. As a result, it generates smoother borders between classes. Apart from the existing kNN approach to handle big datasets, there is not a fuzzy variant to manage that volume of data. Nevertheless, calculating this class membership adds an extra computational cost becoming even less scalable to tackle large datasets because of memory needs and high runtime. In this work, we present an exact and distributed approach to run the Fuzzy-kNN classifier on big datasets based on Spark, which provides the same precision than the original algorithm. It presents two separately stages. The first stage transforms the training set adding the class membership degrees. The second stage classifies with the kNN algorithm the test set using the class membership computed previously. In our experiments, we study the scaling-up capabilities of the proposed approach with datasets up to 11 million instances, showing promising results.\",\"PeriodicalId\":408343,\"journal\":{\"name\":\"2017 IEEE International Conference on Fuzzy Systems (FUZZ-IEEE)\",\"volume\":\"49 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2017-07-10\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"20\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2017 IEEE International Conference on Fuzzy Systems (FUZZ-IEEE)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/FUZZ-IEEE.2017.8015686\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2017 IEEE International Conference on Fuzzy Systems (FUZZ-IEEE)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/FUZZ-IEEE.2017.8015686","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 20

摘要

k近邻(kNN)分类器是监督学习问题中最有效的方法之一。它将未见案例与训练数据的相似性进行分类。然而，它给每个标记样本相同的重要性分类。有几种方法可以提高其精度，其中模糊k近邻(Fuzzy- knn)分类器是最成功的分类器之一。fuzzy - knn计算每个实例对问题类的模糊隶属度。因此，它在类之间生成更平滑的边界。除了现有的kNN方法来处理大数据集，没有一个模糊的变体来管理大量的数据。然而，由于内存需求和高运行时，计算这个类成员会增加额外的计算成本，在处理大型数据集时变得更难以扩展。在这项工作中，我们提出了一种在基于Spark的大数据集上运行模糊knn分类器的精确和分布式方法，该方法提供了与原始算法相同的精度。它呈现出两个独立的阶段。第一阶段对训练集进行变换，加入类隶属度。第二阶段使用kNN算法使用前面计算的类隶属度对测试集进行分类。在我们的实验中，我们研究了该方法在多达1100万个实例的数据集上的扩展能力，显示出有希望的结果。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Exact fuzzy k-nearest neighbor classification for big datasets

The k-Nearest Neighbors (kNN) classifier is one of the most effective methods in supervised learning problems. It classifies unseen cases comparing their similarity with the training data. Nevertheless, it gives to each labeled sample the same importance to classify. There are several approaches to enhance its precision, with the Fuzzy k-Nearest Neighbors (Fuzzy-kNN) classifier being among the most successful ones. Fuzzy-kNN computes a fuzzy degree of membership of each instance to the classes of the problem. As a result, it generates smoother borders between classes. Apart from the existing kNN approach to handle big datasets, there is not a fuzzy variant to manage that volume of data. Nevertheless, calculating this class membership adds an extra computational cost becoming even less scalable to tackle large datasets because of memory needs and high runtime. In this work, we present an exact and distributed approach to run the Fuzzy-kNN classifier on big datasets based on Spark, which provides the same precision than the original algorithm. It presents two separately stages. The first stage transforms the training set adding the class membership degrees. The second stage classifies with the kNN algorithm the test set using the class membership computed previously. In our experiments, we study the scaling-up capabilities of the proposed approach with datasets up to 11 million instances, showing promising results.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2017 IEEE International Conference on Fuzzy Systems (FUZZ-IEEE)

自引率

0.00%

发文量