{"title":"动态数据流的在线嵌入和聚类","authors":"Alaettin Zubaroğlu, V. Atalay","doi":"10.1002/sam.11590","DOIUrl":null,"url":null,"abstract":"Number of connected devices is steadily increasing and this trend is expected to continue in the near future. Connected devices continuously generate data streams and the data streams may often be high dimensional and contain concept drift. Clustering is one of the most suitable methods for real‐time data stream processing, since clustering can be applied with less prior information about the data. Also, data embedding makes the visualization of high dimensional data possible and may simplify clustering process. There exist several data stream clustering algorithms in the literature; however, no data stream embedding method exists. Uniform Manifold Approximation and Projection (UMAP) is a data embedding algorithm that is suitable to be applied on stationary (stable) data streams, though it cannot adapt concept drift. In this study, we describe a novel method EmCStream, to apply UMAP on evolving (nonstationary) data streams, to detect and adapt concept drift and to cluster embedded data instances using a distance or partitioning‐based clustering algorithm. We have evaluated EmCStream against the state‐of‐the‐art stream clustering algorithms using both synthetic and real data streams containing concept drift. EmCStream outperforms DenStream and CluStream, in terms of clustering quality, on both synthetic and real evolving data streams. Datasets and code of this study are available online at https://gitlab.com/alaettinzubaroglu/emcstream.","PeriodicalId":342679,"journal":{"name":"Statistical Analysis and Data Mining: The ASA Data Science Journal","volume":"2 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-07-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":"{\"title\":\"Online embedding and clustering of evolving data streams\",\"authors\":\"Alaettin Zubaroğlu, V. Atalay\",\"doi\":\"10.1002/sam.11590\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Number of connected devices is steadily increasing and this trend is expected to continue in the near future. Connected devices continuously generate data streams and the data streams may often be high dimensional and contain concept drift. Clustering is one of the most suitable methods for real‐time data stream processing, since clustering can be applied with less prior information about the data. Also, data embedding makes the visualization of high dimensional data possible and may simplify clustering process. There exist several data stream clustering algorithms in the literature; however, no data stream embedding method exists. Uniform Manifold Approximation and Projection (UMAP) is a data embedding algorithm that is suitable to be applied on stationary (stable) data streams, though it cannot adapt concept drift. In this study, we describe a novel method EmCStream, to apply UMAP on evolving (nonstationary) data streams, to detect and adapt concept drift and to cluster embedded data instances using a distance or partitioning‐based clustering algorithm. We have evaluated EmCStream against the state‐of‐the‐art stream clustering algorithms using both synthetic and real data streams containing concept drift. EmCStream outperforms DenStream and CluStream, in terms of clustering quality, on both synthetic and real evolving data streams. Datasets and code of this study are available online at https://gitlab.com/alaettinzubaroglu/emcstream.\",\"PeriodicalId\":342679,\"journal\":{\"name\":\"Statistical Analysis and Data Mining: The ASA Data Science Journal\",\"volume\":\"2 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2022-07-06\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"1\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Statistical Analysis and Data Mining: The ASA Data Science Journal\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1002/sam.11590\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Statistical Analysis and Data Mining: The ASA Data Science Journal","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1002/sam.11590","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 1
摘要
连接设备的数量正在稳步增长,预计这一趋势将在不久的将来持续下去。连接的设备不断产生数据流,数据流通常是高维的,并且包含概念漂移。聚类是最适合实时数据流处理的方法之一,因为聚类可以在较少的数据先验信息下应用。此外,数据嵌入使得高维数据的可视化成为可能,可以简化聚类过程。文献中存在几种数据流聚类算法;但是,目前还没有数据流嵌入的方法。均匀流形逼近与投影(Uniform Manifold Approximation and Projection, UMAP)是一种适用于平稳(稳定)数据流的数据嵌入算法,但不能适应概念漂移。在这项研究中,我们描述了一种新的方法EmCStream,将UMAP应用于不断发展的(非平稳)数据流,检测和适应概念漂移,并使用基于距离或分区的聚类算法对嵌入的数据实例进行聚类。我们使用包含概念漂移的合成数据流和真实数据流来评估EmCStream与最先进的流聚类算法。EmCStream在聚类质量方面优于DenStream和CluStream,无论是在合成数据流上还是在真实数据流上。本研究的数据集和代码可在https://gitlab.com/alaettinzubaroglu/emcstream上在线获得。
Online embedding and clustering of evolving data streams
Number of connected devices is steadily increasing and this trend is expected to continue in the near future. Connected devices continuously generate data streams and the data streams may often be high dimensional and contain concept drift. Clustering is one of the most suitable methods for real‐time data stream processing, since clustering can be applied with less prior information about the data. Also, data embedding makes the visualization of high dimensional data possible and may simplify clustering process. There exist several data stream clustering algorithms in the literature; however, no data stream embedding method exists. Uniform Manifold Approximation and Projection (UMAP) is a data embedding algorithm that is suitable to be applied on stationary (stable) data streams, though it cannot adapt concept drift. In this study, we describe a novel method EmCStream, to apply UMAP on evolving (nonstationary) data streams, to detect and adapt concept drift and to cluster embedded data instances using a distance or partitioning‐based clustering algorithm. We have evaluated EmCStream against the state‐of‐the‐art stream clustering algorithms using both synthetic and real data streams containing concept drift. EmCStream outperforms DenStream and CluStream, in terms of clustering quality, on both synthetic and real evolving data streams. Datasets and code of this study are available online at https://gitlab.com/alaettinzubaroglu/emcstream.