{"title":"TPE-AutoClust: A Tree-based Pipline Ensemble Framework for Automated Clustering","authors":"Radwa El Shawi, S. Sakr","doi":"10.1109/ICDMW58026.2022.00149","DOIUrl":null,"url":null,"abstract":"Novel technologies in automated machine learning ease the complexity of building well-performed machine learning pipelines. However, these are usually restricted to supervised learning tasks such as classification and regression, while unsu-pervised learning, particularly clustering, remains a largely un-explored problem due to the ambiguity involved when evaluating the clustering solutions. Motivated by this shortcoming, in this paper, we introduce TPE-AutoClust, a genetic programming-based automated machine learning framework for clustering. TPE-AutoCl ust optimizes a series of feature preprocessors and machine learning models to optimize the performance on an unsupervised clustering task. TPE-AutoClust mainly consists of three main phases: meta-learning phase, optimization phase and clustering ensemble construction phase. The meta-learning phase suggests some instantiations of pipelines that are likely to perform well on a new dataset. These pipelines are used to warmstart the optimization phase that adopts a multi-objective optimization technique to select pipelines based on the Pareto front of the trade-off between the pipeline length and performance. The ensemble construction phase develops a collaborative mechanism based on a clustering ensemble to combine optimized pipelines based on different internal cluster validity indices and construct a well-performing solution for a new dataset. The proposed framework is based on scikit-learn with 4 preprocessors and 6 clustering algorithms. Extensive experiments are conducted on 27 real and synthetic benchmark datasets to validate the superiority of TPE-AutoCl ust. The results show that TPE-AutoClust outperforms the state-of-the-art techniques for building automated clustering solutions.","PeriodicalId":146687,"journal":{"name":"2022 IEEE International Conference on Data Mining Workshops (ICDMW)","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2022-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2022 IEEE International Conference on Data Mining Workshops (ICDMW)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICDMW58026.2022.00149","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
Novel technologies in automated machine learning ease the complexity of building well-performed machine learning pipelines. However, these are usually restricted to supervised learning tasks such as classification and regression, while unsu-pervised learning, particularly clustering, remains a largely un-explored problem due to the ambiguity involved when evaluating the clustering solutions. Motivated by this shortcoming, in this paper, we introduce TPE-AutoClust, a genetic programming-based automated machine learning framework for clustering. TPE-AutoCl ust optimizes a series of feature preprocessors and machine learning models to optimize the performance on an unsupervised clustering task. TPE-AutoClust mainly consists of three main phases: meta-learning phase, optimization phase and clustering ensemble construction phase. The meta-learning phase suggests some instantiations of pipelines that are likely to perform well on a new dataset. These pipelines are used to warmstart the optimization phase that adopts a multi-objective optimization technique to select pipelines based on the Pareto front of the trade-off between the pipeline length and performance. The ensemble construction phase develops a collaborative mechanism based on a clustering ensemble to combine optimized pipelines based on different internal cluster validity indices and construct a well-performing solution for a new dataset. The proposed framework is based on scikit-learn with 4 preprocessors and 6 clustering algorithms. Extensive experiments are conducted on 27 real and synthetic benchmark datasets to validate the superiority of TPE-AutoCl ust. The results show that TPE-AutoClust outperforms the state-of-the-art techniques for building automated clustering solutions.