TEDA-driven adaptive stream clustering for concept drift detection

IF 2.7 3区计算机科学 Q3 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Data & Knowledge Engineering Pub Date : 2025-07-22 DOI:10.1016/j.datak.2025.102484

Zahra Rezaei , Hedieh Sajedi

{"title":"TEDA-driven adaptive stream clustering for concept drift detection","authors":"Zahra Rezaei , Hedieh Sajedi","doi":"10.1016/j.datak.2025.102484","DOIUrl":null,"url":null,"abstract":"<div><div>The rapid growth of data-driven applications has underlined the need for strong methods to analyze and cluster streaming data. Data stream clustering is envisioned to uncover interesting knowledge concealed within data streams, typically fast, structure- and pattern-evolving. However, most current methods suffer significant challenges like the inability to detect clusters with arbitrarily shaped, handling outliers, adaptation to concept drift, and reducing dependency on predefined parameters. To tackle these challenges, we propose a novel Typicality and Eccentricity Data Analysis (TEDA)-based concept drift detection stream clustering algorithm, which can divide the clustering problem into two subproblems, micro-clusters and macro-clusters. Our methodology utilizes a TEDA-based concept drift detection approach to enhance data stream clustering. Our method employs two models in monitoring the data stream to keep the information of a previous concept while tracking the emergence of a new concept. The models represent two distinct concepts when the intersection of data samples is significantly low, as described by the Jaccard Index. TEDA-CDD is compared to known methods from the literature in experiments using synthetic and real-world datasets simulating real-world applications. By dynamically updating clusters through model reuse or creation, our algorithm ensures adaptability to real-time changes in data distributions. The proposed algorithm was comprehensively evaluated using the KDDCup-99 dataset, an intrusion detection system benchmark under diverse scenarios, including concept drifts, evolving data distributions, varying cluster sizes, and outlier conditions. Empirical results demonstrated the algorithm’s superiority over baseline approaches such as DenStream, DStream, ClusTree, and DGStream, achieving perfect performance metrics. These findings emphasize the effectiveness of our algorithm in addressing real-world streaming data challenges, combining high sensitivity to concept drift with computational efficiency, adaptability, and robust clustering capabilities.</div></div>","PeriodicalId":55184,"journal":{"name":"Data & Knowledge Engineering","volume":"160 ","pages":"Article 102484"},"PeriodicalIF":2.7000,"publicationDate":"2025-07-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Data & Knowledge Engineering","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0169023X25000795","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

Abstract

The rapid growth of data-driven applications has underlined the need for strong methods to analyze and cluster streaming data. Data stream clustering is envisioned to uncover interesting knowledge concealed within data streams, typically fast, structure- and pattern-evolving. However, most current methods suffer significant challenges like the inability to detect clusters with arbitrarily shaped, handling outliers, adaptation to concept drift, and reducing dependency on predefined parameters. To tackle these challenges, we propose a novel Typicality and Eccentricity Data Analysis (TEDA)-based concept drift detection stream clustering algorithm, which can divide the clustering problem into two subproblems, micro-clusters and macro-clusters. Our methodology utilizes a TEDA-based concept drift detection approach to enhance data stream clustering. Our method employs two models in monitoring the data stream to keep the information of a previous concept while tracking the emergence of a new concept. The models represent two distinct concepts when the intersection of data samples is significantly low, as described by the Jaccard Index. TEDA-CDD is compared to known methods from the literature in experiments using synthetic and real-world datasets simulating real-world applications. By dynamically updating clusters through model reuse or creation, our algorithm ensures adaptability to real-time changes in data distributions. The proposed algorithm was comprehensively evaluated using the KDDCup-99 dataset, an intrusion detection system benchmark under diverse scenarios, including concept drifts, evolving data distributions, varying cluster sizes, and outlier conditions. Empirical results demonstrated the algorithm’s superiority over baseline approaches such as DenStream, DStream, ClusTree, and DGStream, achieving perfect performance metrics. These findings emphasize the effectiveness of our algorithm in addressing real-world streaming data challenges, combining high sensitivity to concept drift with computational efficiency, adaptability, and robust clustering capabilities.

查看原文本刊更多论文

用于概念漂移检测的teda驱动的自适应流聚类

数据驱动应用程序的快速增长强调了对强大的方法来分析和集群流数据的需求。数据流聚类的设想是发现隐藏在数据流中的有趣的知识，通常是快速的、结构和模式的演变。然而，目前大多数方法都面临着重大挑战，例如无法检测任意形状的聚类、处理异常值、适应概念漂移以及减少对预定义参数的依赖。为了解决这些问题，我们提出了一种新的基于典型和偏心数据分析（TEDA）的概念漂移检测流聚类算法，该算法可以将聚类问题分为微观聚类和宏观聚类两个子问题。我们的方法利用基于teda的概念漂移检测方法来增强数据流聚类。我们的方法采用了两种模型来监控数据流，在跟踪新概念出现的同时保留了之前概念的信息。当数据样本的交叉点非常低时，如Jaccard指数所描述的那样，模型代表两个不同的概念。在使用模拟真实世界应用的合成和真实世界数据集的实验中，将TEDA-CDD与文献中的已知方法进行了比较。通过模型重用或创建动态更新集群，我们的算法确保了对数据分布实时变化的适应性。利用入侵检测系统基准KDDCup-99数据集，在概念漂移、不断变化的数据分布、不同的簇大小和离群值条件等多种场景下，对所提出的算法进行了全面评估。实验结果表明，该算法优于基线方法，如DenStream、DStream、ClusTree和DGStream，实现了完美的性能指标。这些发现强调了我们的算法在解决现实世界流数据挑战方面的有效性，将对概念漂移的高灵敏度与计算效率、适应性和强大的聚类能力相结合。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Data & Knowledge Engineering 工程技术-计算机：人工智能

CiteScore

5.00

自引率

0.00%

发文量

审稿时长

6 months

期刊介绍： Data & Knowledge Engineering (DKE) stimulates the exchange of ideas and interaction between these two related fields of interest. DKE reaches a world-wide audience of researchers, designers, managers and users. The major aim of the journal is to identify, investigate and analyze the underlying principles in the design and effective use of these systems.