SAGA: A Scalable Framework for Optimizing Data Cleaning Pipelines for Machine Learning Applications

Proceedings of the ACM on Management of Data Pub Date : 2023-11-13 DOI:10.1145/3617338

Shafaq Siddiqi, Roman Kern, Matthias Boehm

{"title":"SAGA: A Scalable Framework for Optimizing Data Cleaning Pipelines for Machine Learning Applications","authors":"Shafaq Siddiqi, Roman Kern, Matthias Boehm","doi":"10.1145/3617338","DOIUrl":null,"url":null,"abstract":"In the exploratory data science lifecycle, data scientists often spent the majority of their time finding, integrating, validating and cleaning relevant datasets. Despite recent work on data validation, and numerous error detection and correction algorithms, in practice, data cleaning for ML remains largely a manual, unpleasant, and labor-intensive trial and error process, especially in large-scale, distributed computation. The target ML application---such as classification or regression models---can be used as a signal of valuable feedback though, for selecting effective data cleaning strategies. In this paper, we introduce SAGA, a framework for automatically generating the top-K most effective data cleaning pipelines. SAGA adopts ideas from Auto-ML, feature selection, and hyper-parameter tuning. Our framework is extensible for user-provided constraints, new data cleaning primitives, and ML applications; automatically generates hybrid runtime plans of local and distributed operations; and performs pruning by interesting properties (e.g., monotonicity). Instead of full automation---which is rather unrealistic---SAGA simplifies the mechanical aspects of data cleaning. Our experiments show that SAGA yields robust accuracy improvements over state-of-the-art, and good scalability regarding increasing data sizes and number of evaluated pipelines.","PeriodicalId":498157,"journal":{"name":"Proceedings of the ACM on Management of Data","volume":"36 4","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-11-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the ACM on Management of Data","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3617338","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

In the exploratory data science lifecycle, data scientists often spent the majority of their time finding, integrating, validating and cleaning relevant datasets. Despite recent work on data validation, and numerous error detection and correction algorithms, in practice, data cleaning for ML remains largely a manual, unpleasant, and labor-intensive trial and error process, especially in large-scale, distributed computation. The target ML application---such as classification or regression models---can be used as a signal of valuable feedback though, for selecting effective data cleaning strategies. In this paper, we introduce SAGA, a framework for automatically generating the top-K most effective data cleaning pipelines. SAGA adopts ideas from Auto-ML, feature selection, and hyper-parameter tuning. Our framework is extensible for user-provided constraints, new data cleaning primitives, and ML applications; automatically generates hybrid runtime plans of local and distributed operations; and performs pruning by interesting properties (e.g., monotonicity). Instead of full automation---which is rather unrealistic---SAGA simplifies the mechanical aspects of data cleaning. Our experiments show that SAGA yields robust accuracy improvements over state-of-the-art, and good scalability regarding increasing data sizes and number of evaluated pipelines.

查看原文本刊更多论文

SAGA:用于优化机器学习应用程序的数据清洗管道的可扩展框架

在探索性数据科学生命周期中，数据科学家通常将大部分时间用于查找、集成、验证和清理相关数据集。尽管最近在数据验证和许多错误检测和纠正算法方面进行了工作，但在实践中，ML的数据清理在很大程度上仍然是一个手动的、令人不快的、劳动密集型的试验和错误过程，特别是在大规模的分布式计算中。目标ML应用程序(如分类或回归模型)可以作为有价值的反馈信号，用于选择有效的数据清理策略。在本文中，我们介绍了SAGA，一个自动生成top-K最有效的数据清理管道的框架。SAGA采用了Auto-ML、特征选择和超参数调优的思想。我们的框架可以扩展到用户提供的约束、新的数据清理原语和ML应用程序;自动生成本地和分布式操作的混合运行时计划;并根据有趣的性质(如单调性)执行剪枝。SAGA不是完全自动化(这是相当不现实的)，而是简化了数据清理的机械方面。我们的实验表明，SAGA比最先进的技术产生了强大的准确性改进，并且在增加数据大小和评估管道数量方面具有良好的可扩展性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Proceedings of the ACM on Management of Data

自引率

0.00%

发文量