{"title":"具有可变簇数的数据流聚类支持系统","authors":"J. Silva, Eduardo R. Hruschka","doi":"10.1145/2932704","DOIUrl":null,"url":null,"abstract":"Many algorithms for clustering data streams that are based on the widely used k-Means have been proposed in the literature. Most of these algorithms assume that the number of clusters, k, is known and fixed a priori by the user. Aimed at relaxing this assumption, which is often unrealistic in practical applications, we propose a support system that allows not only estimating the number of clusters automatically from data but also monitoring the process of the data-stream clustering. We illustrate the potential of the proposed system by means of a prototype that implements eight algorithms for clustering data streams, namely, Stream LSearch-OMRk, Stream LSearch-BkM, Stream LSearch-IOMRk, Stream LSearch-IBkM, CluStream-OMRk, CluStream-BkM, StreamKM++-OMRk, and StreamKM++−BkM. These algorithms are combinations of three state-of-the-art algorithms for clustering data streams with fixed k, namely, Stream LSearch, CluStream, and StreamKM++, with two algorithms for estimating the number of clusters, which are Ordered Multiple Runs of k-Means (OMRk) and Bisecting k-Means (BkM). We experimentally compare the performance of these algorithms using both synthetic and real-world data streams. Analyses of statistical significance suggest that the algorithms that are based on OMRk yield the best data partitions, while the algorithms that are based on BkM are more computationally efficient. Additionally, StreamKM++−OMRk and Stream LSearch-IBkM provide the best tradeoff relationship between accuracy and efficiency.","PeriodicalId":377078,"journal":{"name":"ACM Transactions on Autonomous and Adaptive Systems (TAAS)","volume":"s3-46 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2016-07-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"7","resultStr":"{\"title\":\"A Support System for Clustering Data Streams with a Variable Number of Clusters\",\"authors\":\"J. Silva, Eduardo R. Hruschka\",\"doi\":\"10.1145/2932704\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Many algorithms for clustering data streams that are based on the widely used k-Means have been proposed in the literature. Most of these algorithms assume that the number of clusters, k, is known and fixed a priori by the user. Aimed at relaxing this assumption, which is often unrealistic in practical applications, we propose a support system that allows not only estimating the number of clusters automatically from data but also monitoring the process of the data-stream clustering. We illustrate the potential of the proposed system by means of a prototype that implements eight algorithms for clustering data streams, namely, Stream LSearch-OMRk, Stream LSearch-BkM, Stream LSearch-IOMRk, Stream LSearch-IBkM, CluStream-OMRk, CluStream-BkM, StreamKM++-OMRk, and StreamKM++−BkM. These algorithms are combinations of three state-of-the-art algorithms for clustering data streams with fixed k, namely, Stream LSearch, CluStream, and StreamKM++, with two algorithms for estimating the number of clusters, which are Ordered Multiple Runs of k-Means (OMRk) and Bisecting k-Means (BkM). We experimentally compare the performance of these algorithms using both synthetic and real-world data streams. Analyses of statistical significance suggest that the algorithms that are based on OMRk yield the best data partitions, while the algorithms that are based on BkM are more computationally efficient. Additionally, StreamKM++−OMRk and Stream LSearch-IBkM provide the best tradeoff relationship between accuracy and efficiency.\",\"PeriodicalId\":377078,\"journal\":{\"name\":\"ACM Transactions on Autonomous and Adaptive Systems (TAAS)\",\"volume\":\"s3-46 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2016-07-25\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"7\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"ACM Transactions on Autonomous and Adaptive Systems (TAAS)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/2932704\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"ACM Transactions on Autonomous and Adaptive Systems (TAAS)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/2932704","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 7
摘要
文献中已经提出了许多基于广泛使用的k-Means的数据流聚类算法。这些算法中的大多数都假设簇的数量k是已知的,并且是用户先验地固定的。为了打破这种在实际应用中往往不现实的假设,我们提出了一个支持系统,它不仅可以从数据中自动估计聚类的数量,而且可以监控数据流聚类的过程。我们通过一个原型来说明该系统的潜力,该原型实现了8种数据流聚类算法,即Stream LSearch-OMRk、Stream LSearch-BkM、Stream LSearch-IOMRk、Stream LSearch-IBkM、CluStream-OMRk、CluStream-BkM、streamk++ -OMRk和streamk++ -BkM。这些算法结合了三种最先进的算法,用于对固定k的数据流进行聚类,即Stream LSearch、CluStream和streamk++,以及两种估计聚类数量的算法,即有序多次运行k- means (OMRk)和平分k- means (BkM)。我们通过实验比较了这些算法使用合成和真实数据流的性能。统计显著性分析表明,基于OMRk的算法产生最好的数据分区,而基于BkM的算法计算效率更高。此外,streamk++−OMRk和Stream LSearch-IBkM提供了精度和效率之间的最佳权衡关系。
A Support System for Clustering Data Streams with a Variable Number of Clusters
Many algorithms for clustering data streams that are based on the widely used k-Means have been proposed in the literature. Most of these algorithms assume that the number of clusters, k, is known and fixed a priori by the user. Aimed at relaxing this assumption, which is often unrealistic in practical applications, we propose a support system that allows not only estimating the number of clusters automatically from data but also monitoring the process of the data-stream clustering. We illustrate the potential of the proposed system by means of a prototype that implements eight algorithms for clustering data streams, namely, Stream LSearch-OMRk, Stream LSearch-BkM, Stream LSearch-IOMRk, Stream LSearch-IBkM, CluStream-OMRk, CluStream-BkM, StreamKM++-OMRk, and StreamKM++−BkM. These algorithms are combinations of three state-of-the-art algorithms for clustering data streams with fixed k, namely, Stream LSearch, CluStream, and StreamKM++, with two algorithms for estimating the number of clusters, which are Ordered Multiple Runs of k-Means (OMRk) and Bisecting k-Means (BkM). We experimentally compare the performance of these algorithms using both synthetic and real-world data streams. Analyses of statistical significance suggest that the algorithms that are based on OMRk yield the best data partitions, while the algorithms that are based on BkM are more computationally efficient. Additionally, StreamKM++−OMRk and Stream LSearch-IBkM provide the best tradeoff relationship between accuracy and efficiency.