Selecting samples for labeling in unbalanced streaming data environments

Hanqing Hu, M. Kantardzic, Tegjyot Singh Sethi
{"title":"Selecting samples for labeling in unbalanced streaming data environments","authors":"Hanqing Hu, M. Kantardzic, Tegjyot Singh Sethi","doi":"10.1109/ICAT.2013.6684046","DOIUrl":null,"url":null,"abstract":"In this paper we proposed an alternative approach to random selection for labeling extremely unbalanced stream data sets where one class is only 1-10% of the entire data set. Labeling, especially when human resources are needed, is often time consuming and expensive. In an extremely unbalanced data set, usually a lot of data points need to be labeled to get enough minority class samples. The goal of this research was to reduce the total number of samples needed in the labeling process of training new classification models for updating streaming data ensemble classifier. Our proposed approach is to find minority class clusters using the grid density algorithm, and sample minority class instances inside those regions. The result from the synthetic data set showed that efficiency of our proposed approaches varies with different grid sizes. Results on real world data sets confirmed that observation, and showed that when the data set has high dimensionality, dimensionality reduction was useful for reducing the number of grids in the data space increasing sampling efficiency. Our best results showed 19.4% improvement for an eight-dimension data set without dimensionality reduction, and 27.4% improvement for a thirty-six-dimension data set with dimensionality reduction.","PeriodicalId":348701,"journal":{"name":"2013 XXIV International Conference on Information, Communication and Automation Technologies (ICAT)","volume":"9 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2013-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"4","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2013 XXIV International Conference on Information, Communication and Automation Technologies (ICAT)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICAT.2013.6684046","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 4

Abstract

In this paper we proposed an alternative approach to random selection for labeling extremely unbalanced stream data sets where one class is only 1-10% of the entire data set. Labeling, especially when human resources are needed, is often time consuming and expensive. In an extremely unbalanced data set, usually a lot of data points need to be labeled to get enough minority class samples. The goal of this research was to reduce the total number of samples needed in the labeling process of training new classification models for updating streaming data ensemble classifier. Our proposed approach is to find minority class clusters using the grid density algorithm, and sample minority class instances inside those regions. The result from the synthetic data set showed that efficiency of our proposed approaches varies with different grid sizes. Results on real world data sets confirmed that observation, and showed that when the data set has high dimensionality, dimensionality reduction was useful for reducing the number of grids in the data space increasing sampling efficiency. Our best results showed 19.4% improvement for an eight-dimension data set without dimensionality reduction, and 27.4% improvement for a thirty-six-dimension data set with dimensionality reduction.
在不平衡流数据环境中选择样本进行标记
在本文中,我们提出了一种随机选择的替代方法,用于标记极不平衡的流数据集,其中一个类仅占整个数据集的1-10%。贴标签,特别是在需要人力资源的情况下,往往既耗时又昂贵。在极不平衡的数据集中,通常需要标记大量的数据点才能获得足够的少数类样本。本研究的目的是为了减少训练新的分类模型用于更新流数据集成分类器的标记过程中所需的样本总数。我们提出的方法是使用网格密度算法找到少数类簇,并在这些区域内采样少数类实例。综合数据集的结果表明,我们提出的方法的效率随网格大小的不同而不同。在真实数据集上的实验结果证实了这一观察结果,并表明当数据集具有高维数时,降维有助于减少数据空间中的网格数,提高采样效率。我们的最佳结果显示,对于没有降维的8维数据集,改进了19.4%,对于有降维的36维数据集,改进了27.4%。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信