MetaPrep: Data preparation pipelines recommendation via meta-learning

2021 20th IEEE International Conference on Machine Learning and Applications (ICMLA) Pub Date : 2021-12-01 DOI:10.1109/ICMLA52953.2021.00194

F. Zagatti, L. C. Silva, Lucas Nildaimon Dos Santos Silva, B. S. Sette, Helena de Medeiros Caseli, D. Lucrédio, D. F. Silva

{"title":"MetaPrep: Data preparation pipelines recommendation via meta-learning","authors":"F. Zagatti, L. C. Silva, Lucas Nildaimon Dos Santos Silva, B. S. Sette, Helena de Medeiros Caseli, D. Lucrédio, D. F. Silva","doi":"10.1109/ICMLA52953.2021.00194","DOIUrl":null,"url":null,"abstract":"Data preparation is a mandatory phase in the machine learning pipeline. The goal of data preparation is to convert noisy and disordered data into refined data that can be used by the algorithms. However, data preparation is time-consuming and requires specialized knowledge about the data and algorithms. Therefore, automating data preparation is essential to decrease the effort made by data scientists to develop satisfactory models. Despite its relevance, current AutoML platforms disregard or make simple hardcoded data preparation pipelines. Trying to fill this gap, we present a meta-learning-based recommendation system for data preparation. Our system recommends five pipelines, ranked by their relevance, making it useful for users with varying degrees of experience. Using the top-1 pipeline we demonstrated that our proposal allows a better performance of an AutoML system. Furthermore, the accuracy rates of our method were comparable to those achieved by a reinforcement-learning-based algorithm with the same goal, but it was up to two orders of magnitude faster. Moreover, we tested our method in a real-world application and evaluated its benefits and limitations in this scenario.","PeriodicalId":6750,"journal":{"name":"2021 20th IEEE International Conference on Machine Learning and Applications (ICMLA)","volume":"26 1","pages":"1197-1202"},"PeriodicalIF":0.0000,"publicationDate":"2021-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2021 20th IEEE International Conference on Machine Learning and Applications (ICMLA)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICMLA52953.2021.00194","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Data preparation is a mandatory phase in the machine learning pipeline. The goal of data preparation is to convert noisy and disordered data into refined data that can be used by the algorithms. However, data preparation is time-consuming and requires specialized knowledge about the data and algorithms. Therefore, automating data preparation is essential to decrease the effort made by data scientists to develop satisfactory models. Despite its relevance, current AutoML platforms disregard or make simple hardcoded data preparation pipelines. Trying to fill this gap, we present a meta-learning-based recommendation system for data preparation. Our system recommends five pipelines, ranked by their relevance, making it useful for users with varying degrees of experience. Using the top-1 pipeline we demonstrated that our proposal allows a better performance of an AutoML system. Furthermore, the accuracy rates of our method were comparable to those achieved by a reinforcement-learning-based algorithm with the same goal, but it was up to two orders of magnitude faster. Moreover, we tested our method in a real-world application and evaluated its benefits and limitations in this scenario.

查看原文本刊更多论文

MetaPrep:通过元学习推荐数据准备管道

数据准备是机器学习管道中必不可少的阶段。数据准备的目标是将有噪声和无序的数据转换为算法可以使用的精细数据。但是，数据准备非常耗时，并且需要对数据和算法有专门的了解。因此，自动化数据准备对于减少数据科学家开发令人满意的模型的工作量至关重要。尽管它具有相关性，但当前的AutoML平台忽略或制作简单的硬编码数据准备管道。为了填补这一空白，我们提出了一个基于元学习的数据准备推荐系统。我们的系统推荐了五个管道，根据它们的相关性进行排名，使其对具有不同程度经验的用户有用。通过使用top-1管道，我们证明了我们的建议可以提高AutoML系统的性能。此外，我们的方法的准确率与基于强化学习的算法所达到的准确率相当，但速度快了两个数量级。此外，我们在实际应用程序中测试了我们的方法，并评估了它在此场景中的优点和局限性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2021 20th IEEE International Conference on Machine Learning and Applications (ICMLA)

自引率

0.00%

发文量