Subspace selection in high-dimensional big data using genetic algorithm in apache spark

Fatemeh Cheraghchi, Arash Iranzad, B. Raahemi
{"title":"Subspace selection in high-dimensional big data using genetic algorithm in apache spark","authors":"Fatemeh Cheraghchi, Arash Iranzad, B. Raahemi","doi":"10.1145/3018896.3018950","DOIUrl":null,"url":null,"abstract":"In high-dimensional space with large amounts of data, distances between data points tend to become relatively uniform. The notion of the nearest neighbours of a data point thus becomes meaningless, a phenomenon known as \"curse of dimensionality.\" Identifying outliers (data points with statistical characteristics significantly different than the majority of the data) in such a high-dimensional space can be a significant challenge. Mining for outliers in subspaces with relevant attributes is one of approaches for this problem, and identifying these attributes is the main objective of this work. In this paper, we scale a grid-based solution to search for subspaces that are candidates for outlier detection with regard to the subset of features in the subspace. We specify a population and a fitness function for a distributed genetic algorithm to heuristically search the subspaces within the high dimensional data, and find the subspace with maximal sparsity. We designed and implemented our proposed subspace selection algorithm in Apache Spark, a fast in-memory engine for large-scale data processing. The initial experimental results on a large dataset (77,000 records and 1,379 attributes) confirm that our proposed method can identify the most relevant subspaces for outlier detection.","PeriodicalId":131464,"journal":{"name":"Proceedings of the Second International Conference on Internet of things, Data and Cloud Computing","volume":"22 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2017-03-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the Second International Conference on Internet of things, Data and Cloud Computing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3018896.3018950","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 2

Abstract

In high-dimensional space with large amounts of data, distances between data points tend to become relatively uniform. The notion of the nearest neighbours of a data point thus becomes meaningless, a phenomenon known as "curse of dimensionality." Identifying outliers (data points with statistical characteristics significantly different than the majority of the data) in such a high-dimensional space can be a significant challenge. Mining for outliers in subspaces with relevant attributes is one of approaches for this problem, and identifying these attributes is the main objective of this work. In this paper, we scale a grid-based solution to search for subspaces that are candidates for outlier detection with regard to the subset of features in the subspace. We specify a population and a fitness function for a distributed genetic algorithm to heuristically search the subspaces within the high dimensional data, and find the subspace with maximal sparsity. We designed and implemented our proposed subspace selection algorithm in Apache Spark, a fast in-memory engine for large-scale data processing. The initial experimental results on a large dataset (77,000 records and 1,379 attributes) confirm that our proposed method can identify the most relevant subspaces for outlier detection.
基于apache spark遗传算法的高维大数据子空间选择
在具有大量数据的高维空间中,数据点之间的距离趋于相对均匀。数据点最近邻居的概念因此变得毫无意义,这种现象被称为“维度诅咒”。在这样一个高维空间中识别离群值(具有与大多数数据显著不同的统计特征的数据点)可能是一项重大挑战。在具有相关属性的子空间中挖掘异常值是解决该问题的方法之一,识别这些属性是本工作的主要目标。在本文中,我们扩展了一个基于网格的解决方案来搜索子空间,这些子空间是关于子空间中特征子集的离群值检测的候选者。我们为分布式遗传算法指定种群和适应度函数,启发式地搜索高维数据中的子空间,并找到稀疏度最大的子空间。我们在Apache Spark中设计并实现了我们提出的子空间选择算法,Apache Spark是一个用于大规模数据处理的快速内存引擎。在大型数据集(77,000条记录和1,379个属性)上的初步实验结果证实,我们提出的方法可以识别出最相关的子空间进行离群值检测。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信