Engineering MLOps Pipelines With Data Quality: A Case Study on Tabular Datasets in Kaggle

IF 1.8 4区计算机科学 Q3 COMPUTER SCIENCE, SOFTWARE ENGINEERING

Journal of Software-Evolution and Process Pub Date : 2025-09-08 DOI:10.1002/smr.70044

Matteo Pancini, Matteo Camilli, Giovanni Quattrocchi, Damian Andrew Tamburri

{"title":"Engineering MLOps Pipelines With Data Quality: A Case Study on Tabular Datasets in Kaggle","authors":"Matteo Pancini, Matteo Camilli, Giovanni Quattrocchi, Damian Andrew Tamburri","doi":"10.1002/smr.70044","DOIUrl":null,"url":null,"abstract":"<p>Ensuring high-quality data is crucial for the successful deployment of machine learning models, thereby sustaining the operational pipelines around such models. However, a significant number of practitioners do not currently use data quality checks or measurements as gateways for their model construction and operationalization, indicating a need for greater awareness and adoption of these tools. In this study, we propose an automated approach for automating the process of architecting machine learning pipelines by means of (semi-)automated data quality checks. We focus on tabular data as a representative of the most widely used structured data formats in said pipelines. Our work is based on a subset of metrics that are particularly relevant in MLOps pipelines, stemming from our engagement with expert practitioners in machine learning operations (MLOps). We selected Deepchecks, a well-known tool for conducting data quality checks, from a cohort of similar tools to evaluate the quality of datasets collected from Kaggle, a widely used platform for machine learning competitions and data science projects. We also analyze the main features used by Kaggle to rank their datasets and used these features to validate the relevance of our approach. Our approach shows the potential for automated data quality checks to improve the efficiency and effectiveness of MLOps pipelines and their operation, by decreasing the risk of introducing errors and biases into machine learning models in production.</p>","PeriodicalId":48898,"journal":{"name":"Journal of Software-Evolution and Process","volume":"37 9","pages":""},"PeriodicalIF":1.8000,"publicationDate":"2025-09-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1002/smr.70044","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Software-Evolution and Process","FirstCategoryId":"94","ListUrlMain":"https://onlinelibrary.wiley.com/doi/10.1002/smr.70044","RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"COMPUTER SCIENCE, SOFTWARE ENGINEERING","Score":null,"Total":0}

引用次数: 0

Abstract

Ensuring high-quality data is crucial for the successful deployment of machine learning models, thereby sustaining the operational pipelines around such models. However, a significant number of practitioners do not currently use data quality checks or measurements as gateways for their model construction and operationalization, indicating a need for greater awareness and adoption of these tools. In this study, we propose an automated approach for automating the process of architecting machine learning pipelines by means of (semi-)automated data quality checks. We focus on tabular data as a representative of the most widely used structured data formats in said pipelines. Our work is based on a subset of metrics that are particularly relevant in MLOps pipelines, stemming from our engagement with expert practitioners in machine learning operations (MLOps). We selected Deepchecks, a well-known tool for conducting data quality checks, from a cohort of similar tools to evaluate the quality of datasets collected from Kaggle, a widely used platform for machine learning competitions and data science projects. We also analyze the main features used by Kaggle to rank their datasets and used these features to validate the relevance of our approach. Our approach shows the potential for automated data quality checks to improve the efficiency and effectiveness of MLOps pipelines and their operation, by decreasing the risk of introducing errors and biases into machine learning models in production.

Abstract Image

查看原文本刊更多论文

具有数据质量的工程MLOps管道：Kaggle中表格数据集的案例研究

确保高质量的数据对于成功部署机器学习模型至关重要，从而维持围绕这些模型的操作管道。然而，相当数量的从业者目前没有使用数据质量检查或测量作为模型构建和操作化的网关，这表明需要更多地了解和采用这些工具。在这项研究中，我们提出了一种自动化的方法，通过（半）自动化的数据质量检查来自动化构建机器学习管道的过程。我们关注表格数据，将其作为上述管道中最广泛使用的结构化数据格式的代表。我们的工作是基于MLOps管道中特别相关的指标子集，这源于我们与机器学习操作（MLOps）的专家从业人员的合作。我们从一系列类似的工具中选择了Deepchecks，这是一个著名的数据质量检查工具，用于评估从Kaggle收集的数据集的质量，Kaggle是一个广泛用于机器学习竞赛和数据科学项目的平台。我们还分析了Kaggle用来对数据集进行排序的主要特征，并使用这些特征来验证我们方法的相关性。我们的方法显示了自动化数据质量检查的潜力，通过降低在生产中引入错误和偏差的机器学习模型的风险，可以提高MLOps管道及其操作的效率和有效性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Journal of Software-Evolution and Process COMPUTER SCIENCE, SOFTWARE ENGINEERING-

自引率

10.00%

发文量

109