EM*: An EM Algorithm for Big Data

2016 IEEE International Conference on Data Science and Advanced Analytics (DSAA) Pub Date : 2016-10-01 DOI:10.1109/DSAA.2016.40

H. Kurban, Mark Jenne, Mehmet M. Dalkilic

引用次数: 3

Abstract

Existing data mining techniques, more particularly iterative learning algorithms, become overwhelmed with big data. While parallelism is an obvious and, usually, necessary strategy, we observe that both (1) continually revisiting data and (2) visiting all data are two of the most prominent problems especially for iterative, unsupervised algorithms like Expectation Maximization algorithm for clustering (EM-T). Our strategy is to embed EM-T into a non-linear hierarchical data structure(heap) that allows us to (1) separate data that needs to be revisited from data that does not and (2) narrow the iteration toward the data that is more difficult to cluster. We call this extended EM-T, EM*. We show our EM* algorithm outperform EM-T algorithm over large real world and synthetic data sets. We lastly conclude with some theoretic underpinnings that explain why EM* is successful.

查看原文本刊更多论文

EM*:面向大数据的EM算法

现有的数据挖掘技术，尤其是迭代学习算法，已经被大数据淹没了。虽然并行是一个明显的，通常是必要的策略，但我们观察到(1)不断重访数据和(2)访问所有数据是两个最突出的问题，特别是对于迭代，无监督算法，如聚类期望最大化算法(EM-T)。我们的策略是将EM-T嵌入到非线性分层数据结构(堆)中，使我们能够(1)将需要重新访问的数据与不需要重新访问的数据分开;(2)将迭代范围缩小到更难以聚类的数据。我们称之为扩展的EM- t, EM*。我们的EM*算法在大型真实世界和合成数据集上优于EM- t算法。最后，我们总结了一些理论基础来解释为什么EM*是成功的。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2016 IEEE International Conference on Data Science and Advanced Analytics (DSAA)

自引率

0.00%

发文量