基于k-均值的变聚类演化数据流扩展算法

2011 10th International Conference on Machine Learning and Applications and Workshops Pub Date : 2011-12-18 DOI:10.1109/ICMLA.2011.67

J. Silva, Eduardo R. Hruschka

{"title":"基于k-均值的变聚类演化数据流扩展算法","authors":"J. Silva, Eduardo R. Hruschka","doi":"10.1109/ICMLA.2011.67","DOIUrl":null,"url":null,"abstract":"Many algorithms for clustering data streams based on the widely used k-Means have been proposed in the literature. Most of them assume that the number of clusters, k, is known and fixed a priori by the user. Aimed at relaxing this assumption, which is often unrealistic in practical applications, we describe an algorithmic framework that allows estimating k automatically from data. We illustrate the potential of the proposed framework by using three state-of-the-art algorithms for clustering data streams - Stream LSearch, CluStream, and Stream KM++ - combined with two well-known algorithms for estimating the number of clusters, namely: Ordered Multiple Runs of k-Means (OMRk) and Bisecting k-Means (BkM). As an additional contribution, we experimentally compare the resulting algorithmic instantiations in both synthetic and real-world data streams. Analyses of statistical significance suggest that OMRk yields to the best data partitions, while BkM is more computationally efficient. Also, the combination of Stream KM++ with OMRk leads to the best trade-off between accuracy and efficiency.","PeriodicalId":439926,"journal":{"name":"2011 10th International Conference on Machine Learning and Applications and Workshops","volume":"11 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2011-12-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"20","resultStr":"{\"title\":\"Extending k-Means-Based Algorithms for Evolving Data Streams with Variable Number of Clusters\",\"authors\":\"J. Silva, Eduardo R. Hruschka\",\"doi\":\"10.1109/ICMLA.2011.67\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Many algorithms for clustering data streams based on the widely used k-Means have been proposed in the literature. Most of them assume that the number of clusters, k, is known and fixed a priori by the user. Aimed at relaxing this assumption, which is often unrealistic in practical applications, we describe an algorithmic framework that allows estimating k automatically from data. We illustrate the potential of the proposed framework by using three state-of-the-art algorithms for clustering data streams - Stream LSearch, CluStream, and Stream KM++ - combined with two well-known algorithms for estimating the number of clusters, namely: Ordered Multiple Runs of k-Means (OMRk) and Bisecting k-Means (BkM). As an additional contribution, we experimentally compare the resulting algorithmic instantiations in both synthetic and real-world data streams. Analyses of statistical significance suggest that OMRk yields to the best data partitions, while BkM is more computationally efficient. Also, the combination of Stream KM++ with OMRk leads to the best trade-off between accuracy and efficiency.\",\"PeriodicalId\":439926,\"journal\":{\"name\":\"2011 10th International Conference on Machine Learning and Applications and Workshops\",\"volume\":\"11 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2011-12-18\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"20\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2011 10th International Conference on Machine Learning and Applications and Workshops\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ICMLA.2011.67\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2011 10th International Conference on Machine Learning and Applications and Workshops","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICMLA.2011.67","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 20

摘要

文献中已经提出了许多基于广泛使用的k-Means的数据流聚类算法。它们中的大多数假设簇的数量k是已知的，并且是用户先验地固定的。为了放松这个在实际应用中通常不现实的假设，我们描述了一个允许从数据中自动估计k的算法框架。我们通过使用三种最先进的聚类数据流算法(Stream LSearch, CluStream和Stream k++)以及两种众所周知的估计聚类数量的算法(即:有序多次运行k-Means (OMRk)和平分k-Means (BkM))来说明所提出框架的潜力。作为额外的贡献，我们通过实验比较了合成数据流和真实数据流中产生的算法实例。统计显著性分析表明，OMRk产生最好的数据分区，而BkM的计算效率更高。此外，Stream k++与OMRk的结合在准确性和效率之间取得了最佳的平衡。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Extending k-Means-Based Algorithms for Evolving Data Streams with Variable Number of Clusters

Many algorithms for clustering data streams based on the widely used k-Means have been proposed in the literature. Most of them assume that the number of clusters, k, is known and fixed a priori by the user. Aimed at relaxing this assumption, which is often unrealistic in practical applications, we describe an algorithmic framework that allows estimating k automatically from data. We illustrate the potential of the proposed framework by using three state-of-the-art algorithms for clustering data streams - Stream LSearch, CluStream, and Stream KM++ - combined with two well-known algorithms for estimating the number of clusters, namely: Ordered Multiple Runs of k-Means (OMRk) and Bisecting k-Means (BkM). As an additional contribution, we experimentally compare the resulting algorithmic instantiations in both synthetic and real-world data streams. Analyses of statistical significance suggest that OMRk yields to the best data partitions, while BkM is more computationally efficient. Also, the combination of Stream KM++ with OMRk leads to the best trade-off between accuracy and efficiency.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2011 10th International Conference on Machine Learning and Applications and Workshops

自引率

0.00%

发文量