TPE-AutoClust: A Tree-based Pipline Ensemble Framework for Automated Clustering

2022 IEEE International Conference on Data Mining Workshops (ICDMW) Pub Date : 2022-11-01 DOI:10.1109/ICDMW58026.2022.00149

Radwa El Shawi, S. Sakr

{"title":"TPE-AutoClust: A Tree-based Pipline Ensemble Framework for Automated Clustering","authors":"Radwa El Shawi, S. Sakr","doi":"10.1109/ICDMW58026.2022.00149","DOIUrl":null,"url":null,"abstract":"Novel technologies in automated machine learning ease the complexity of building well-performed machine learning pipelines. However, these are usually restricted to supervised learning tasks such as classification and regression, while unsu-pervised learning, particularly clustering, remains a largely un-explored problem due to the ambiguity involved when evaluating the clustering solutions. Motivated by this shortcoming, in this paper, we introduce TPE-AutoClust, a genetic programming-based automated machine learning framework for clustering. TPE-AutoCl ust optimizes a series of feature preprocessors and machine learning models to optimize the performance on an unsupervised clustering task. TPE-AutoClust mainly consists of three main phases: meta-learning phase, optimization phase and clustering ensemble construction phase. The meta-learning phase suggests some instantiations of pipelines that are likely to perform well on a new dataset. These pipelines are used to warmstart the optimization phase that adopts a multi-objective optimization technique to select pipelines based on the Pareto front of the trade-off between the pipeline length and performance. The ensemble construction phase develops a collaborative mechanism based on a clustering ensemble to combine optimized pipelines based on different internal cluster validity indices and construct a well-performing solution for a new dataset. The proposed framework is based on scikit-learn with 4 preprocessors and 6 clustering algorithms. Extensive experiments are conducted on 27 real and synthetic benchmark datasets to validate the superiority of TPE-AutoCl ust. The results show that TPE-AutoClust outperforms the state-of-the-art techniques for building automated clustering solutions.","PeriodicalId":146687,"journal":{"name":"2022 IEEE International Conference on Data Mining Workshops (ICDMW)","volume":"53 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2022 IEEE International Conference on Data Mining Workshops (ICDMW)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICDMW58026.2022.00149","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Novel technologies in automated machine learning ease the complexity of building well-performed machine learning pipelines. However, these are usually restricted to supervised learning tasks such as classification and regression, while unsu-pervised learning, particularly clustering, remains a largely un-explored problem due to the ambiguity involved when evaluating the clustering solutions. Motivated by this shortcoming, in this paper, we introduce TPE-AutoClust, a genetic programming-based automated machine learning framework for clustering. TPE-AutoCl ust optimizes a series of feature preprocessors and machine learning models to optimize the performance on an unsupervised clustering task. TPE-AutoClust mainly consists of three main phases: meta-learning phase, optimization phase and clustering ensemble construction phase. The meta-learning phase suggests some instantiations of pipelines that are likely to perform well on a new dataset. These pipelines are used to warmstart the optimization phase that adopts a multi-objective optimization technique to select pipelines based on the Pareto front of the trade-off between the pipeline length and performance. The ensemble construction phase develops a collaborative mechanism based on a clustering ensemble to combine optimized pipelines based on different internal cluster validity indices and construct a well-performing solution for a new dataset. The proposed framework is based on scikit-learn with 4 preprocessors and 6 clustering algorithms. Extensive experiments are conducted on 27 real and synthetic benchmark datasets to validate the superiority of TPE-AutoCl ust. The results show that TPE-AutoClust outperforms the state-of-the-art techniques for building automated clustering solutions.

查看原文本刊更多论文

tpe - autocluster:用于自动聚类的基于树的管道集成框架

自动化机器学习中的新技术简化了构建性能良好的机器学习管道的复杂性。然而，这些通常仅限于监督学习任务，如分类和回归，而非监督学习，特别是聚类，由于在评估聚类解决方案时涉及的模糊性，仍然是一个很大程度上未被探索的问题。基于这一缺点，本文引入了基于遗传编程的自动机器学习聚类框架TPE-AutoClust。TPE-AutoCl优化了一系列特征预处理器和机器学习模型，以优化无监督聚类任务的性能。TPE-AutoClust主要包括三个主要阶段:元学习阶段、优化阶段和聚类集成构建阶段。元学习阶段提出了一些可能在新数据集上表现良好的管道实例。这些管道用于启动优化阶段，该阶段采用基于管道长度和性能之间权衡的Pareto前的多目标优化技术来选择管道。集成构建阶段开发了基于聚类集成的协作机制，将基于不同内部聚类有效性指标的优化管道组合在一起，为新数据集构建性能良好的解决方案。该框架基于scikit-learn，包含4个预处理器和6种聚类算法。在27个真实和合成的基准数据集上进行了大量的实验，验证了TPE-AutoCl - ust的优越性。结果表明，TPE-AutoClust在构建自动化集群解决方案方面优于最先进的技术。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2022 IEEE International Conference on Data Mining Workshops (ICDMW)

自引率

0.00%

发文量