Anytime clustering of data streams while handling noise and concept drift

IF 1.7 4区计算机科学 Q3 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Journal of Experimental & Theoretical Artificial Intelligence Pub Date : 2021-03-15 DOI:10.1080/0952813X.2021.1882001

Jagat Sesh Challa, Poonam Goyal, Ajinkya Kokandakar, D. Mantri, Pranet Verma, S. Balasubramaniam, Navneet Goyal

{"title":"Anytime clustering of data streams while handling noise and concept drift","authors":"Jagat Sesh Challa, Poonam Goyal, Ajinkya Kokandakar, D. Mantri, Pranet Verma, S. Balasubramaniam, Navneet Goyal","doi":"10.1080/0952813X.2021.1882001","DOIUrl":null,"url":null,"abstract":"ABSTRACT Clustering of data streams has become very popular in recent times, owing to rapid rise of real-time streaming utilities that produce large amounts of data at varying inter-arrival rates. We propose AnyClus, a framework for anytime clustering of data streams. AnyClus uses a proposed variant of R-tree, AnyRTree, to capture the incoming stream objects arriving at variable rate, and to index them in the form of micro-clusters of hierarchical fashion. The leaf-level micro-clusters produced are aggregated and stored in a logarithmic tilted-time window framework (TTWF). Our extensive experimental analysis shows (i) the capability of AnyClus in handling variable stream speeds (upto 250k objects/second); (ii) its ability to produce micro-clusters of high purity (≈1) and compactness; (iii) effectiveness of AnyRTree in handling noise, capturing concept drift and preservation of spatial locality in the indexing of micro-clusters, when compared to the existing methods. We also propose a parallel framework, Any-MP-Clus, for anytime clustering of multiport data streams over commodity clusters. Any-MP-Clus uses AnyRTree at each computing node of the cluster (for each stream-port) and maintains the aggregated micro-clusters in TTWF. The experimental results on datasets of billions scale show that Any-MP-Clus is scalable, eﬃcient and produces clustering of higher quality.","PeriodicalId":15677,"journal":{"name":"Journal of Experimental & Theoretical Artificial Intelligence","volume":"62 1","pages":"399 - 429"},"PeriodicalIF":1.7000,"publicationDate":"2021-03-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Experimental & Theoretical Artificial Intelligence","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1080/0952813X.2021.1882001","RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 2

Abstract

ABSTRACT Clustering of data streams has become very popular in recent times, owing to rapid rise of real-time streaming utilities that produce large amounts of data at varying inter-arrival rates. We propose AnyClus, a framework for anytime clustering of data streams. AnyClus uses a proposed variant of R-tree, AnyRTree, to capture the incoming stream objects arriving at variable rate, and to index them in the form of micro-clusters of hierarchical fashion. The leaf-level micro-clusters produced are aggregated and stored in a logarithmic tilted-time window framework (TTWF). Our extensive experimental analysis shows (i) the capability of AnyClus in handling variable stream speeds (upto 250k objects/second); (ii) its ability to produce micro-clusters of high purity (≈1) and compactness; (iii) effectiveness of AnyRTree in handling noise, capturing concept drift and preservation of spatial locality in the indexing of micro-clusters, when compared to the existing methods. We also propose a parallel framework, Any-MP-Clus, for anytime clustering of multiport data streams over commodity clusters. Any-MP-Clus uses AnyRTree at each computing node of the cluster (for each stream-port) and maintains the aggregated micro-clusters in TTWF. The experimental results on datasets of billions scale show that Any-MP-Clus is scalable, eﬃcient and produces clustering of higher quality.

查看原文本刊更多论文

随时聚类数据流，同时处理噪声和概念漂移

数据流聚类近年来变得非常流行，这是由于实时流实用程序的迅速兴起，这些实用程序以不同的到达速率产生大量数据。我们提出了AnyClus，一个用于数据流随时聚类的框架。AnyClus使用R-tree的提议变体AnyRTree来捕获以可变速率到达的传入流对象，并以分层方式的微集群的形式对它们进行索引。产生的叶片级微簇被聚合并存储在对数倾斜时间窗口框架(TTWF)中。我们广泛的实验分析表明(i) AnyClus处理可变流速度(高达250k对象/秒)的能力;(ii)生产高纯度(≈1)和致密度的微团簇的能力;(iii)与现有方法相比，AnyRTree在处理噪声、捕捉概念漂移和保存微聚类索引的空间局域性方面的有效性。我们还提出了一个并行框架，Any-MP-Clus，用于在商品集群上随时聚类多端口数据流。Any-MP-Clus在集群的每个计算节点(对于每个流端口)使用AnyRTree，并在TTWF中维护聚合的微集群。在数十亿规模数据集上的实验结果表明，Any-MP-Clus具有可扩展性、效率高、聚类质量高的特点。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Journal of Experimental & Theoretical Artificial Intelligence 工程技术-计算机：人工智能

CiteScore

6.10

自引率

4.50%

发文量

审稿时长

>12 weeks

期刊介绍： Journal of Experimental & Theoretical Artificial Intelligence (JETAI) is a world leading journal dedicated to publishing high quality, rigorously reviewed, original papers in artificial intelligence (AI) research. The journal features work in all subfields of AI research and accepts both theoretical and applied research. Topics covered include, but are not limited to, the following: • cognitive science • games • learning • knowledge representation • memory and neural system modelling • perception • problem-solving