结合基于小子样本的最近邻分类器进行大数据分析

2015 IEEE 2nd International Conference on Cybernetics (CYBCONF) Pub Date : 2015-06-24 DOI:10.1109/CYBConf.2015.7175952

B. Krawczyk, Michal Wozniak

{"title":"结合基于小子样本的最近邻分类器进行大数据分析","authors":"B. Krawczyk, Michal Wozniak","doi":"10.1109/CYBConf.2015.7175952","DOIUrl":null,"url":null,"abstract":"Contemporary machine learning systems must be able to deal with ever-growing volumes of data. However, most of the canonical classifiers are not well-suited for big data analytics. This is especially vivid in case of distance-based classifiers, where their classification time is prohibitive. Recently, many methods for adapting nearest neighbour classifier for big data were proposed. We investigate simple, yet efficient technique based on random under-sampling of the dataset. As we deal with stationary data, one may assume that a subset of objects will sufficiently capture the properties of given dataset. We propose to build distance-based classifiers on the basis of very small subsamples and then combine them into an ensemble. With this, one does not need to aggregate datasets, only local decisions of classifiers. On the basis of experimental results we show that such an approach can return comparable results to nearest neighbour classifier over the entire dataset, but with a significantly reduced classification time. We investigate the number of sub-samples (ensemble members), that are required for capturing the properties of each dataset. Finally, we propose to apply our sub-sampling based ensemble in a distributed environment, which allows for a further reduction of the computational complexity of nearest neighbour rule for big data.","PeriodicalId":177233,"journal":{"name":"2015 IEEE 2nd International Conference on Cybernetics (CYBCONF)","volume":"32 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2015-06-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Combining nearest neighbour classifiers based on small subsamples for big data analytics\",\"authors\":\"B. Krawczyk, Michal Wozniak\",\"doi\":\"10.1109/CYBConf.2015.7175952\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Contemporary machine learning systems must be able to deal with ever-growing volumes of data. However, most of the canonical classifiers are not well-suited for big data analytics. This is especially vivid in case of distance-based classifiers, where their classification time is prohibitive. Recently, many methods for adapting nearest neighbour classifier for big data were proposed. We investigate simple, yet efficient technique based on random under-sampling of the dataset. As we deal with stationary data, one may assume that a subset of objects will sufficiently capture the properties of given dataset. We propose to build distance-based classifiers on the basis of very small subsamples and then combine them into an ensemble. With this, one does not need to aggregate datasets, only local decisions of classifiers. On the basis of experimental results we show that such an approach can return comparable results to nearest neighbour classifier over the entire dataset, but with a significantly reduced classification time. We investigate the number of sub-samples (ensemble members), that are required for capturing the properties of each dataset. Finally, we propose to apply our sub-sampling based ensemble in a distributed environment, which allows for a further reduction of the computational complexity of nearest neighbour rule for big data.\",\"PeriodicalId\":177233,\"journal\":{\"name\":\"2015 IEEE 2nd International Conference on Cybernetics (CYBCONF)\",\"volume\":\"32 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2015-06-24\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2015 IEEE 2nd International Conference on Cybernetics (CYBCONF)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/CYBConf.2015.7175952\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2015 IEEE 2nd International Conference on Cybernetics (CYBCONF)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/CYBConf.2015.7175952","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

当代机器学习系统必须能够处理不断增长的数据量。然而，大多数规范分类器并不适合大数据分析。这在基于距离的分类器的情况下尤其明显，因为它们的分类时间令人望而却步。近年来，人们提出了许多适应大数据的最近邻分类器方法。我们研究了基于数据集随机欠采样的简单而有效的技术。当我们处理固定数据时，可以假设对象的子集将充分捕获给定数据集的属性。我们建议在非常小的子样本的基础上构建基于距离的分类器，然后将它们组合成一个集合。这样，就不需要聚合数据集，只需要分类器的局部决策。在实验结果的基础上，我们表明这种方法可以在整个数据集上返回与最近邻分类器相当的结果，但显著减少了分类时间。我们研究了捕获每个数据集属性所需的子样本(集合成员)的数量。最后，我们建议在分布式环境中应用我们的基于子采样的集成，这可以进一步降低大数据最近邻规则的计算复杂性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Combining nearest neighbour classifiers based on small subsamples for big data analytics

Contemporary machine learning systems must be able to deal with ever-growing volumes of data. However, most of the canonical classifiers are not well-suited for big data analytics. This is especially vivid in case of distance-based classifiers, where their classification time is prohibitive. Recently, many methods for adapting nearest neighbour classifier for big data were proposed. We investigate simple, yet efficient technique based on random under-sampling of the dataset. As we deal with stationary data, one may assume that a subset of objects will sufficiently capture the properties of given dataset. We propose to build distance-based classifiers on the basis of very small subsamples and then combine them into an ensemble. With this, one does not need to aggregate datasets, only local decisions of classifiers. On the basis of experimental results we show that such an approach can return comparable results to nearest neighbour classifier over the entire dataset, but with a significantly reduced classification time. We investigate the number of sub-samples (ensemble members), that are required for capturing the properties of each dataset. Finally, we propose to apply our sub-sampling based ensemble in a distributed environment, which allows for a further reduction of the computational complexity of nearest neighbour rule for big data.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2015 IEEE 2nd International Conference on Cybernetics (CYBCONF)

自引率

0.00%

发文量