{"title":"TEDA-driven adaptive stream clustering for concept drift detection","authors":"Zahra Rezaei , Hedieh Sajedi","doi":"10.1016/j.datak.2025.102484","DOIUrl":null,"url":null,"abstract":"<div><div>The rapid growth of data-driven applications has underlined the need for strong methods to analyze and cluster streaming data. Data stream clustering is envisioned to uncover interesting knowledge concealed within data streams, typically fast, structure- and pattern-evolving. However, most current methods suffer significant challenges like the inability to detect clusters with arbitrarily shaped, handling outliers, adaptation to concept drift, and reducing dependency on predefined parameters. To tackle these challenges, we propose a novel Typicality and Eccentricity Data Analysis (TEDA)-based concept drift detection stream clustering algorithm, which can divide the clustering problem into two subproblems, micro-clusters and macro-clusters. Our methodology utilizes a TEDA-based concept drift detection approach to enhance data stream clustering. Our method employs two models in monitoring the data stream to keep the information of a previous concept while tracking the emergence of a new concept. The models represent two distinct concepts when the intersection of data samples is significantly low, as described by the Jaccard Index. TEDA-CDD is compared to known methods from the literature in experiments using synthetic and real-world datasets simulating real-world applications. By dynamically updating clusters through model reuse or creation, our algorithm ensures adaptability to real-time changes in data distributions. The proposed algorithm was comprehensively evaluated using the KDDCup-99 dataset, an intrusion detection system benchmark under diverse scenarios, including concept drifts, evolving data distributions, varying cluster sizes, and outlier conditions. Empirical results demonstrated the algorithm’s superiority over baseline approaches such as DenStream, DStream, ClusTree, and DGStream, achieving perfect performance metrics. These findings emphasize the effectiveness of our algorithm in addressing real-world streaming data challenges, combining high sensitivity to concept drift with computational efficiency, adaptability, and robust clustering capabilities.</div></div>","PeriodicalId":55184,"journal":{"name":"Data & Knowledge Engineering","volume":"160 ","pages":"Article 102484"},"PeriodicalIF":2.7000,"publicationDate":"2025-07-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Data & Knowledge Engineering","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0169023X25000795","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
引用次数: 0
Abstract
The rapid growth of data-driven applications has underlined the need for strong methods to analyze and cluster streaming data. Data stream clustering is envisioned to uncover interesting knowledge concealed within data streams, typically fast, structure- and pattern-evolving. However, most current methods suffer significant challenges like the inability to detect clusters with arbitrarily shaped, handling outliers, adaptation to concept drift, and reducing dependency on predefined parameters. To tackle these challenges, we propose a novel Typicality and Eccentricity Data Analysis (TEDA)-based concept drift detection stream clustering algorithm, which can divide the clustering problem into two subproblems, micro-clusters and macro-clusters. Our methodology utilizes a TEDA-based concept drift detection approach to enhance data stream clustering. Our method employs two models in monitoring the data stream to keep the information of a previous concept while tracking the emergence of a new concept. The models represent two distinct concepts when the intersection of data samples is significantly low, as described by the Jaccard Index. TEDA-CDD is compared to known methods from the literature in experiments using synthetic and real-world datasets simulating real-world applications. By dynamically updating clusters through model reuse or creation, our algorithm ensures adaptability to real-time changes in data distributions. The proposed algorithm was comprehensively evaluated using the KDDCup-99 dataset, an intrusion detection system benchmark under diverse scenarios, including concept drifts, evolving data distributions, varying cluster sizes, and outlier conditions. Empirical results demonstrated the algorithm’s superiority over baseline approaches such as DenStream, DStream, ClusTree, and DGStream, achieving perfect performance metrics. These findings emphasize the effectiveness of our algorithm in addressing real-world streaming data challenges, combining high sensitivity to concept drift with computational efficiency, adaptability, and robust clustering capabilities.
期刊介绍:
Data & Knowledge Engineering (DKE) stimulates the exchange of ideas and interaction between these two related fields of interest. DKE reaches a world-wide audience of researchers, designers, managers and users. The major aim of the journal is to identify, investigate and analyze the underlying principles in the design and effective use of these systems.