Subspace selection in high-dimensional big data using genetic algorithm in apache spark

Proceedings of the Second International Conference on Internet of things, Data and Cloud Computing Pub Date : 2017-03-22 DOI:10.1145/3018896.3018950

Fatemeh Cheraghchi, Arash Iranzad, B. Raahemi

{"title":"Subspace selection in high-dimensional big data using genetic algorithm in apache spark","authors":"Fatemeh Cheraghchi, Arash Iranzad, B. Raahemi","doi":"10.1145/3018896.3018950","DOIUrl":null,"url":null,"abstract":"In high-dimensional space with large amounts of data, distances between data points tend to become relatively uniform. The notion of the nearest neighbours of a data point thus becomes meaningless, a phenomenon known as \"curse of dimensionality.\" Identifying outliers (data points with statistical characteristics significantly different than the majority of the data) in such a high-dimensional space can be a significant challenge. Mining for outliers in subspaces with relevant attributes is one of approaches for this problem, and identifying these attributes is the main objective of this work. In this paper, we scale a grid-based solution to search for subspaces that are candidates for outlier detection with regard to the subset of features in the subspace. We specify a population and a fitness function for a distributed genetic algorithm to heuristically search the subspaces within the high dimensional data, and find the subspace with maximal sparsity. We designed and implemented our proposed subspace selection algorithm in Apache Spark, a fast in-memory engine for large-scale data processing. The initial experimental results on a large dataset (77,000 records and 1,379 attributes) confirm that our proposed method can identify the most relevant subspaces for outlier detection.","PeriodicalId":131464,"journal":{"name":"Proceedings of the Second International Conference on Internet of things, Data and Cloud Computing","volume":"22 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2017-03-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the Second International Conference on Internet of things, Data and Cloud Computing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3018896.3018950","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 2

Abstract

In high-dimensional space with large amounts of data, distances between data points tend to become relatively uniform. The notion of the nearest neighbours of a data point thus becomes meaningless, a phenomenon known as "curse of dimensionality." Identifying outliers (data points with statistical characteristics significantly different than the majority of the data) in such a high-dimensional space can be a significant challenge. Mining for outliers in subspaces with relevant attributes is one of approaches for this problem, and identifying these attributes is the main objective of this work. In this paper, we scale a grid-based solution to search for subspaces that are candidates for outlier detection with regard to the subset of features in the subspace. We specify a population and a fitness function for a distributed genetic algorithm to heuristically search the subspaces within the high dimensional data, and find the subspace with maximal sparsity. We designed and implemented our proposed subspace selection algorithm in Apache Spark, a fast in-memory engine for large-scale data processing. The initial experimental results on a large dataset (77,000 records and 1,379 attributes) confirm that our proposed method can identify the most relevant subspaces for outlier detection.

查看原文本刊更多论文

基于apache spark遗传算法的高维大数据子空间选择

在具有大量数据的高维空间中，数据点之间的距离趋于相对均匀。数据点最近邻居的概念因此变得毫无意义，这种现象被称为“维度诅咒”。在这样一个高维空间中识别离群值(具有与大多数数据显著不同的统计特征的数据点)可能是一项重大挑战。在具有相关属性的子空间中挖掘异常值是解决该问题的方法之一，识别这些属性是本工作的主要目标。在本文中，我们扩展了一个基于网格的解决方案来搜索子空间，这些子空间是关于子空间中特征子集的离群值检测的候选者。我们为分布式遗传算法指定种群和适应度函数，启发式地搜索高维数据中的子空间，并找到稀疏度最大的子空间。我们在Apache Spark中设计并实现了我们提出的子空间选择算法，Apache Spark是一个用于大规模数据处理的快速内存引擎。在大型数据集(77,000条记录和1,379个属性)上的初步实验结果证实，我们提出的方法可以识别出最相关的子空间进行离群值检测。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Proceedings of the Second International Conference on Internet of things, Data and Cloud Computing

自引率

0.00%

发文量