Yiyi Zhang , Ning Wang , Qixiong Zeng , Liangwei Li
{"title":"Automating data preparation pipeline efficiently via Monte Carlo tree search","authors":"Yiyi Zhang , Ning Wang , Qixiong Zeng , Liangwei Li","doi":"10.1016/j.ins.2025.122730","DOIUrl":null,"url":null,"abstract":"<div><div>As a crucial step in machine learning, data preparation is the most time and energy consuming task for data scientists, entailing a number of data processing techniques to improve the performance of output results for ML models. However, end-to-end AutoML research focuses on automated machine learning pipelines consisting of algorithm selection and hyper parameter tuning, falling short in comprehensive automation of data preparation. In this paper, we propose Auto-DP, an MCTS-based framework for efficient and automated data preparation. To guide the search more effectively, a neural network is designed to estimate the subsequent maximum performance gain of each tree node. In order to reduce search space and improve system efficiency, two optimization strategies, meta-learning and accelerated training strategy, are used to determine the type and order of tasks in the data preparation process in advance, and speed up the pipeline creation process. We compare Auto-DP with the popular AutoML systems on 60 real datasets from OpenML repository. Auto-DP improves the Accuracy by up to 18.11 % on the classification task and reduces the Mse by up to 25.75 % on the regression task. Furthermore, it achieves a performance in 10 s that is better than what four popular AutoML systems achieve in 1 h.</div></div>","PeriodicalId":51063,"journal":{"name":"Information Sciences","volume":"724 ","pages":"Article 122730"},"PeriodicalIF":6.8000,"publicationDate":"2025-09-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Information Sciences","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0020025525008667","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"0","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}
引用次数: 0
Abstract
As a crucial step in machine learning, data preparation is the most time and energy consuming task for data scientists, entailing a number of data processing techniques to improve the performance of output results for ML models. However, end-to-end AutoML research focuses on automated machine learning pipelines consisting of algorithm selection and hyper parameter tuning, falling short in comprehensive automation of data preparation. In this paper, we propose Auto-DP, an MCTS-based framework for efficient and automated data preparation. To guide the search more effectively, a neural network is designed to estimate the subsequent maximum performance gain of each tree node. In order to reduce search space and improve system efficiency, two optimization strategies, meta-learning and accelerated training strategy, are used to determine the type and order of tasks in the data preparation process in advance, and speed up the pipeline creation process. We compare Auto-DP with the popular AutoML systems on 60 real datasets from OpenML repository. Auto-DP improves the Accuracy by up to 18.11 % on the classification task and reduces the Mse by up to 25.75 % on the regression task. Furthermore, it achieves a performance in 10 s that is better than what four popular AutoML systems achieve in 1 h.
期刊介绍:
Informatics and Computer Science Intelligent Systems Applications is an esteemed international journal that focuses on publishing original and creative research findings in the field of information sciences. We also feature a limited number of timely tutorial and surveying contributions.
Our journal aims to cater to a diverse audience, including researchers, developers, managers, strategic planners, graduate students, and anyone interested in staying up-to-date with cutting-edge research in information science, knowledge engineering, and intelligent systems. While readers are expected to share a common interest in information science, they come from varying backgrounds such as engineering, mathematics, statistics, physics, computer science, cell biology, molecular biology, management science, cognitive science, neurobiology, behavioral sciences, and biochemistry.