An Efficient Hybrid-Clustream Algorithm for Stream Mining

2017 13th International Conference on Signal-Image Technology & Internet-Based Systems (SITIS) Pub Date : 2017-12-01 DOI:10.1109/SITIS.2017.77

Ashish Kumar, Ajmer Singh, Rajvir Singh

{"title":"An Efficient Hybrid-Clustream Algorithm for Stream Mining","authors":"Ashish Kumar, Ajmer Singh, Rajvir Singh","doi":"10.1109/SITIS.2017.77","DOIUrl":null,"url":null,"abstract":"Stream clustering is a standout amongst the most imperative fields in machine learning. Traditional unsupervised clustering tasks have been normally carried out in batch mode where data could be somehow fitted in memory and therefore several passes on the data are allowed. However the new Big Data paradigm has created a new environment where data can be potentially non-finite and arrive continuously. Such streams of data can reach computing systems at high speeds and contain data generation processes which might be non-stationary. For clustering tasks, this implies inconceivability to store all information in memory and obscure number and size of clusters. Noise levels can also be high due to either data generation or transmission. All these factors make traditional clustering methods not suitable to cope. As a consequence, stream clustering has emerged as a field of intense research with the aim of tackling these challenges. Clustream is one of the most advanced state of the art stream clustering algorithm. It normally requires two phases: first online micro-clustering phase, where statistics are gathered describing the incoming data; and a second offline macro-clustering phase, where a conventional non-stream clustering algorithm is executed using the high level statistics resulting from the online step. Because of its design, it requires expert-level parametrization or suffers from low runtime performance or has high sensitivity to noise or degrade considerably in high dimensional spaces because of their offline step. We propose a new stream clustering algorithm, the Clustream-hybrid based on Clustream clustering principles. It extends the same process used in Clustream but uses k-means++ instead of k-means in macro-clustering phase enabling it to accomplish quick runtime calculation while additionally keeping accuracy in high dimensional settings. We integrate it in MOA (Massive Online Analysis) tool. We evaluated the results with nine clustering quality metrics and compared the performance with Clustream for both synthetic and real data sets. The results are encproposedaging, outperforming in most of the cases in quality metrics.","PeriodicalId":153165,"journal":{"name":"2017 13th International Conference on Signal-Image Technology & Internet-Based Systems (SITIS)","volume":"34 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2017-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"3","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2017 13th International Conference on Signal-Image Technology & Internet-Based Systems (SITIS)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/SITIS.2017.77","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 3

Abstract

Stream clustering is a standout amongst the most imperative fields in machine learning. Traditional unsupervised clustering tasks have been normally carried out in batch mode where data could be somehow fitted in memory and therefore several passes on the data are allowed. However the new Big Data paradigm has created a new environment where data can be potentially non-finite and arrive continuously. Such streams of data can reach computing systems at high speeds and contain data generation processes which might be non-stationary. For clustering tasks, this implies inconceivability to store all information in memory and obscure number and size of clusters. Noise levels can also be high due to either data generation or transmission. All these factors make traditional clustering methods not suitable to cope. As a consequence, stream clustering has emerged as a field of intense research with the aim of tackling these challenges. Clustream is one of the most advanced state of the art stream clustering algorithm. It normally requires two phases: first online micro-clustering phase, where statistics are gathered describing the incoming data; and a second offline macro-clustering phase, where a conventional non-stream clustering algorithm is executed using the high level statistics resulting from the online step. Because of its design, it requires expert-level parametrization or suffers from low runtime performance or has high sensitivity to noise or degrade considerably in high dimensional spaces because of their offline step. We propose a new stream clustering algorithm, the Clustream-hybrid based on Clustream clustering principles. It extends the same process used in Clustream but uses k-means++ instead of k-means in macro-clustering phase enabling it to accomplish quick runtime calculation while additionally keeping accuracy in high dimensional settings. We integrate it in MOA (Massive Online Analysis) tool. We evaluated the results with nine clustering quality metrics and compared the performance with Clustream for both synthetic and real data sets. The results are encproposedaging, outperforming in most of the cases in quality metrics.

查看原文本刊更多论文

一种高效的混合- clustream流挖掘算法

流聚类是机器学习中最重要的领域之一。传统的无监督聚类任务通常以批处理模式执行，其中数据可以以某种方式装入内存，因此允许对数据进行多次传递。然而，新的大数据范式创造了一个新的环境，在这个环境中，数据可能是无限的，并且可以连续到达。这样的数据流可以高速到达计算系统，并包含可能是非平稳的数据生成过程。对于集群任务，这意味着无法将所有信息存储在内存中，并且集群的数量和大小模糊不清。由于数据产生或传输，噪声水平也可能很高。这些因素都使得传统的聚类方法难以应对。因此，为了解决这些挑战，流聚类已经成为一个深入研究的领域。Clustream是目前最先进的流聚类算法之一。它通常需要两个阶段:第一个在线微聚类阶段，在此阶段收集描述传入数据的统计数据;第二个离线宏聚类阶段，使用在线步骤产生的高级统计数据执行传统的非流聚类算法。由于其设计，它需要专家级的参数化，或者运行时性能较低，或者对噪声高度敏感，或者由于其离线步骤而在高维空间中显著下降。基于Clustream聚类原理，提出了一种新的流聚类算法Clustream-hybrid。它扩展了Clustream中使用的相同过程，但在宏聚类阶段使用k-means++而不是k-means，使其能够完成快速运行时计算，同时在高维设置中保持准确性。我们将其集成到MOA(大规模在线分析)工具中。我们用9个聚类质量指标评估了结果，并比较了Clustream在合成数据集和真实数据集上的性能。结果是令人满意的，在大多数情况下，在质量度量方面表现优异。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2017 13th International Conference on Signal-Image Technology & Internet-Based Systems (SITIS)

自引率

0.00%

发文量