Using Mahout for Clustering Wikipedia's Latest Articles: A Comparison between K-means and Fuzzy C-means in the Cloud

2011 IEEE Third International Conference on Cloud Computing Technology and Science Pub Date : 2011-11-29 DOI:10.1109/CLOUDCOM.2011.86

R. Esteves, Chunming Rong

引用次数: 73

Abstract

This paper compares k-means and fuzzy c-means for clustering a noisy realistic and big dataset. We made the comparison using a free cloud computing solution Apache Mahout/ Hadoop and Wikipedia's latest articles. In the past the usage of these two algorithms was restricted to small datasets. As so, studies were based on artificial datasets that do not represent a real document clustering situation. With this ongoing research we found that in a noisy dataset, fuzzy c-means can lead to worse cluster quality than k-means. The convergence speed of k-means is not always faster. We found as well that Mahout is a promise clustering technology but the preprocessing tools are not developed enough for an efficient dimensionality reduction. From our experience the use of the Apache Mahout is premature.

查看原文本刊更多论文

使用Mahout聚类维基百科的最新文章:云中的K-means和模糊C-means的比较

本文比较了k-均值和模糊c-均值聚类在嘈杂现实大数据集中的应用。我们使用免费的云计算解决方案Apache Mahout/ Hadoop和维基百科的最新文章进行了比较。在过去，这两种算法的使用仅限于小数据集。因此，研究基于人工数据集，不能代表真实的文档聚类情况。随着这项正在进行的研究，我们发现在有噪声的数据集中，模糊c-means可能导致比k-means更差的聚类质量。k-均值的收敛速度并不总是更快。我们还发现Mahout是一种很有前途的聚类技术，但预处理工具还不够成熟，无法实现有效的降维。从我们的经验来看，使用Apache Mahout还为时过早。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2011 IEEE Third International Conference on Cloud Computing Technology and Science

自引率

0.00%

发文量