Huanliang Sun, Ge Yu, Y. Bao, Faxin Zhao, Daling Wang
{"title":"CDS-Tree:数据流中任意形状聚类的有效索引","authors":"Huanliang Sun, Ge Yu, Y. Bao, Faxin Zhao, Daling Wang","doi":"10.1109/RIDE.2005.8","DOIUrl":null,"url":null,"abstract":"Finding clusters of arbitrary shapes in data streams is a challenging work for advanced applications. An effective approach to clustering arbitrary shapes is the clustering algorithm based on space partition. However, it cannot be applied directly into data stream clustering since it costs large memory spaces while data stream processing has strict memory space limitation. In addition, it has low efficiency for high dimensional data and fine granularity. Moreover, its fixed granularity partition isn't suitable for the changes on data distribution of data streams. Therefore, we propose a novel index structure CDS-Tree and design an improved space partition based clustering algorithm, which aims to cluster arbitrary shapes on high dimension streams data with high accuracy. CDS-Tree stores only non-empty cells and keeps the position relationship among cells, so its compact structure costs small memory spaces and gets high efficiency. Moreover, we propose a novel measure for data skew - DSF (Data Skew Factor) to be used to adjust automatically the partition granularity according to the change of data streams, thus the algorithm can gain high analysis accuracy within limited memory. The experimental results on real datasets and synthetic datasets show that this algorithm has higher clustering accuracy, and better scalability with the size of windows and data dimensionality than other typical algorithms applied in trivial style.","PeriodicalId":404914,"journal":{"name":"15th International Workshop on Research Issues in Data Engineering: Stream Data Mining and Applications (RIDE-SDMA'05)","volume":"55 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2005-04-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"10","resultStr":"{\"title\":\"CDS-Tree: an effective index for clustering arbitrary shapes in data streams\",\"authors\":\"Huanliang Sun, Ge Yu, Y. Bao, Faxin Zhao, Daling Wang\",\"doi\":\"10.1109/RIDE.2005.8\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Finding clusters of arbitrary shapes in data streams is a challenging work for advanced applications. An effective approach to clustering arbitrary shapes is the clustering algorithm based on space partition. However, it cannot be applied directly into data stream clustering since it costs large memory spaces while data stream processing has strict memory space limitation. In addition, it has low efficiency for high dimensional data and fine granularity. Moreover, its fixed granularity partition isn't suitable for the changes on data distribution of data streams. Therefore, we propose a novel index structure CDS-Tree and design an improved space partition based clustering algorithm, which aims to cluster arbitrary shapes on high dimension streams data with high accuracy. CDS-Tree stores only non-empty cells and keeps the position relationship among cells, so its compact structure costs small memory spaces and gets high efficiency. Moreover, we propose a novel measure for data skew - DSF (Data Skew Factor) to be used to adjust automatically the partition granularity according to the change of data streams, thus the algorithm can gain high analysis accuracy within limited memory. The experimental results on real datasets and synthetic datasets show that this algorithm has higher clustering accuracy, and better scalability with the size of windows and data dimensionality than other typical algorithms applied in trivial style.\",\"PeriodicalId\":404914,\"journal\":{\"name\":\"15th International Workshop on Research Issues in Data Engineering: Stream Data Mining and Applications (RIDE-SDMA'05)\",\"volume\":\"55 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2005-04-03\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"10\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"15th International Workshop on Research Issues in Data Engineering: Stream Data Mining and Applications (RIDE-SDMA'05)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/RIDE.2005.8\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"15th International Workshop on Research Issues in Data Engineering: Stream Data Mining and Applications (RIDE-SDMA'05)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/RIDE.2005.8","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
CDS-Tree: an effective index for clustering arbitrary shapes in data streams
Finding clusters of arbitrary shapes in data streams is a challenging work for advanced applications. An effective approach to clustering arbitrary shapes is the clustering algorithm based on space partition. However, it cannot be applied directly into data stream clustering since it costs large memory spaces while data stream processing has strict memory space limitation. In addition, it has low efficiency for high dimensional data and fine granularity. Moreover, its fixed granularity partition isn't suitable for the changes on data distribution of data streams. Therefore, we propose a novel index structure CDS-Tree and design an improved space partition based clustering algorithm, which aims to cluster arbitrary shapes on high dimension streams data with high accuracy. CDS-Tree stores only non-empty cells and keeps the position relationship among cells, so its compact structure costs small memory spaces and gets high efficiency. Moreover, we propose a novel measure for data skew - DSF (Data Skew Factor) to be used to adjust automatically the partition granularity according to the change of data streams, thus the algorithm can gain high analysis accuracy within limited memory. The experimental results on real datasets and synthetic datasets show that this algorithm has higher clustering accuracy, and better scalability with the size of windows and data dimensionality than other typical algorithms applied in trivial style.