结合基于小子样本的最近邻分类器进行大数据分析

B. Krawczyk, Michal Wozniak
{"title":"结合基于小子样本的最近邻分类器进行大数据分析","authors":"B. Krawczyk, Michal Wozniak","doi":"10.1109/CYBConf.2015.7175952","DOIUrl":null,"url":null,"abstract":"Contemporary machine learning systems must be able to deal with ever-growing volumes of data. However, most of the canonical classifiers are not well-suited for big data analytics. This is especially vivid in case of distance-based classifiers, where their classification time is prohibitive. Recently, many methods for adapting nearest neighbour classifier for big data were proposed. We investigate simple, yet efficient technique based on random under-sampling of the dataset. As we deal with stationary data, one may assume that a subset of objects will sufficiently capture the properties of given dataset. We propose to build distance-based classifiers on the basis of very small subsamples and then combine them into an ensemble. With this, one does not need to aggregate datasets, only local decisions of classifiers. On the basis of experimental results we show that such an approach can return comparable results to nearest neighbour classifier over the entire dataset, but with a significantly reduced classification time. We investigate the number of sub-samples (ensemble members), that are required for capturing the properties of each dataset. Finally, we propose to apply our sub-sampling based ensemble in a distributed environment, which allows for a further reduction of the computational complexity of nearest neighbour rule for big data.","PeriodicalId":177233,"journal":{"name":"2015 IEEE 2nd International Conference on Cybernetics (CYBCONF)","volume":"32 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2015-06-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Combining nearest neighbour classifiers based on small subsamples for big data analytics\",\"authors\":\"B. Krawczyk, Michal Wozniak\",\"doi\":\"10.1109/CYBConf.2015.7175952\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Contemporary machine learning systems must be able to deal with ever-growing volumes of data. However, most of the canonical classifiers are not well-suited for big data analytics. This is especially vivid in case of distance-based classifiers, where their classification time is prohibitive. Recently, many methods for adapting nearest neighbour classifier for big data were proposed. We investigate simple, yet efficient technique based on random under-sampling of the dataset. As we deal with stationary data, one may assume that a subset of objects will sufficiently capture the properties of given dataset. We propose to build distance-based classifiers on the basis of very small subsamples and then combine them into an ensemble. With this, one does not need to aggregate datasets, only local decisions of classifiers. On the basis of experimental results we show that such an approach can return comparable results to nearest neighbour classifier over the entire dataset, but with a significantly reduced classification time. We investigate the number of sub-samples (ensemble members), that are required for capturing the properties of each dataset. Finally, we propose to apply our sub-sampling based ensemble in a distributed environment, which allows for a further reduction of the computational complexity of nearest neighbour rule for big data.\",\"PeriodicalId\":177233,\"journal\":{\"name\":\"2015 IEEE 2nd International Conference on Cybernetics (CYBCONF)\",\"volume\":\"32 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2015-06-24\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2015 IEEE 2nd International Conference on Cybernetics (CYBCONF)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/CYBConf.2015.7175952\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2015 IEEE 2nd International Conference on Cybernetics (CYBCONF)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/CYBConf.2015.7175952","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

摘要

当代机器学习系统必须能够处理不断增长的数据量。然而,大多数规范分类器并不适合大数据分析。这在基于距离的分类器的情况下尤其明显,因为它们的分类时间令人望而却步。近年来,人们提出了许多适应大数据的最近邻分类器方法。我们研究了基于数据集随机欠采样的简单而有效的技术。当我们处理固定数据时,可以假设对象的子集将充分捕获给定数据集的属性。我们建议在非常小的子样本的基础上构建基于距离的分类器,然后将它们组合成一个集合。这样,就不需要聚合数据集,只需要分类器的局部决策。在实验结果的基础上,我们表明这种方法可以在整个数据集上返回与最近邻分类器相当的结果,但显著减少了分类时间。我们研究了捕获每个数据集属性所需的子样本(集合成员)的数量。最后,我们建议在分布式环境中应用我们的基于子采样的集成,这可以进一步降低大数据最近邻规则的计算复杂性。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
Combining nearest neighbour classifiers based on small subsamples for big data analytics
Contemporary machine learning systems must be able to deal with ever-growing volumes of data. However, most of the canonical classifiers are not well-suited for big data analytics. This is especially vivid in case of distance-based classifiers, where their classification time is prohibitive. Recently, many methods for adapting nearest neighbour classifier for big data were proposed. We investigate simple, yet efficient technique based on random under-sampling of the dataset. As we deal with stationary data, one may assume that a subset of objects will sufficiently capture the properties of given dataset. We propose to build distance-based classifiers on the basis of very small subsamples and then combine them into an ensemble. With this, one does not need to aggregate datasets, only local decisions of classifiers. On the basis of experimental results we show that such an approach can return comparable results to nearest neighbour classifier over the entire dataset, but with a significantly reduced classification time. We investigate the number of sub-samples (ensemble members), that are required for capturing the properties of each dataset. Finally, we propose to apply our sub-sampling based ensemble in a distributed environment, which allows for a further reduction of the computational complexity of nearest neighbour rule for big data.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信