Distributed Entropy Minimization Discretizer for Big Data Analysis under Apache Spark

2015 IEEE Trustcom/BigDataSE/ISPA Pub Date : 2015-08-20 DOI:10.1109/Trustcom.2015.559

S. Ramírez-Gallego, S. García, Héctor Mouriño-Talín, David Martínez-Rego

引用次数: 20

Abstract

The astonishing rate of data generation on the Internet nowadays has caused that many classical knowledge extraction techniques have become obsolete. Data reduction techniques are required in order to reduce the complexity order held by these techniques. Among reduction techniques, discretization is one of the most important tasks in data mining process, aimed at simplifying and reducing continuous-valued data in large datasets. In spite of the great interest in this reduction mechanism, only a few simple discretization techniques have been implemented in the literature for Big Data. Thereby we propose a distributed implementation of the entropy minimization discretizer proposed by Fayyad and Irani using Apache Spark platform. Our solution goes beyond a simple parallelization, transforming the iterativity yielded by the original proposal in a single-step computation. Experimental results on two large-scale datasets show that our solution is able to improve the classification accuracy as well as boosting the underlying learning process.

查看原文本刊更多论文

基于Apache Spark的分布式熵最小化大数据分析离散器

当今互联网上惊人的数据生成速度导致许多经典的知识提取技术已经过时。为了降低这些技术的复杂度，需要使用数据简化技术。在约简技术中，离散化是数据挖掘过程中最重要的任务之一，旨在简化和约简大型数据集中的连续值数据。尽管人们对这种约简机制非常感兴趣，但文献中只有少数简单的离散化技术被用于大数据。因此，我们提出了一种基于Apache Spark平台的分布式实现Fayyad和Irani提出的熵最小化离散器。我们的解决方案超越了简单的并行化，将原始提议产生的迭代性转化为单步计算。在两个大规模数据集上的实验结果表明，我们的解决方案能够提高分类精度，并促进底层学习过程。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2015 IEEE Trustcom/BigDataSE/ISPA

自引率

0.00%

发文量