{"title":"O-Cluster: scalable clustering of large high dimensional data sets","authors":"B. Milenova, M. Campos","doi":"10.1109/ICDM.2002.1183915","DOIUrl":null,"url":null,"abstract":"Clustering large data sets of high dimensionality has always been a challenge for clustering algorithms. Many recently developed clustering algorithms have attempted to address either handling data sets with a very large number of records and/or with a very high number of dimensions. We provide a discussion of the advantages and limitations of existing algorithms when they operate on very large multidimensional data sets. To simultaneously overcome both the \"curse of dimensionality\" and the scalability problems associated with large amounts of data, we propose a new clustering algorithm called O-Cluster. O-Cluster combines a novel active sampling technique with an axis-parallel partitioning strategy to identify continuous areas of high density in the input space. The method operates on a limited memory buffer and requires at most a single scan through the data. We demonstrate the high quality of the obtained clustering solutions, their robustness to noise, and O-Cluster's excellent scalability.","PeriodicalId":405340,"journal":{"name":"2002 IEEE International Conference on Data Mining, 2002. Proceedings.","volume":"39 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2002-12-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"82","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2002 IEEE International Conference on Data Mining, 2002. Proceedings.","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICDM.2002.1183915","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 82
Abstract
Clustering large data sets of high dimensionality has always been a challenge for clustering algorithms. Many recently developed clustering algorithms have attempted to address either handling data sets with a very large number of records and/or with a very high number of dimensions. We provide a discussion of the advantages and limitations of existing algorithms when they operate on very large multidimensional data sets. To simultaneously overcome both the "curse of dimensionality" and the scalability problems associated with large amounts of data, we propose a new clustering algorithm called O-Cluster. O-Cluster combines a novel active sampling technique with an axis-parallel partitioning strategy to identify continuous areas of high density in the input space. The method operates on a limited memory buffer and requires at most a single scan through the data. We demonstrate the high quality of the obtained clustering solutions, their robustness to noise, and O-Cluster's excellent scalability.