主动筛选机器学习管道与ARGUSEYES

Companion of the 2023 International Conference on Management of Data Pub Date : 2023-06-04 DOI:10.1145/3555041.3589682

Sebastian Schelter, Stefan Grafberger, Shubha Guha, Bojan Karlas, Ce Zhang

{"title":"主动筛选机器学习管道与ARGUSEYES","authors":"Sebastian Schelter, Stefan Grafberger, Shubha Guha, Bojan Karlas, Ce Zhang","doi":"10.1145/3555041.3589682","DOIUrl":null,"url":null,"abstract":"Software systems that learn from data with machine learning (ML) are ubiquitous. ML pipelines in these applications often suffer from a variety of data-related issues, such as data leakage, label errors or fairness violations, which require reasoning about complex dependencies between their inputs and outputs. These issues are usually only detected in hindsight after deployment, after they caused harm in production. We demonstrate ArgusEyes, a system which enables data scientists to proactively screen their ML pipelines for data-related issues as part of continuous integration. ArgusEyes instruments, executes and screens ML pipelines for declaratively specified pipeline issues, and analyzes data artifacts and their provenance to catch potential problems early before deployment to production. We demonstrate our system for three scenarios: detecting mislabeled images in a computer vision pipeline, spotting data leakage in a price prediction pipeline, and addressing fairness violations in a credit scoring pipeline.","PeriodicalId":161812,"journal":{"name":"Companion of the 2023 International Conference on Management of Data","volume":"21 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-06-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":"{\"title\":\"Proactively Screening Machine Learning Pipelines with ARGUSEYES\",\"authors\":\"Sebastian Schelter, Stefan Grafberger, Shubha Guha, Bojan Karlas, Ce Zhang\",\"doi\":\"10.1145/3555041.3589682\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Software systems that learn from data with machine learning (ML) are ubiquitous. ML pipelines in these applications often suffer from a variety of data-related issues, such as data leakage, label errors or fairness violations, which require reasoning about complex dependencies between their inputs and outputs. These issues are usually only detected in hindsight after deployment, after they caused harm in production. We demonstrate ArgusEyes, a system which enables data scientists to proactively screen their ML pipelines for data-related issues as part of continuous integration. ArgusEyes instruments, executes and screens ML pipelines for declaratively specified pipeline issues, and analyzes data artifacts and their provenance to catch potential problems early before deployment to production. We demonstrate our system for three scenarios: detecting mislabeled images in a computer vision pipeline, spotting data leakage in a price prediction pipeline, and addressing fairness violations in a credit scoring pipeline.\",\"PeriodicalId\":161812,\"journal\":{\"name\":\"Companion of the 2023 International Conference on Management of Data\",\"volume\":\"21 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2023-06-04\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"2\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Companion of the 2023 International Conference on Management of Data\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3555041.3589682\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Companion of the 2023 International Conference on Management of Data","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3555041.3589682","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 2

摘要

通过机器学习(ML)从数据中学习的软件系统无处不在。这些应用程序中的ML管道经常遭受各种与数据相关的问题，例如数据泄漏、标签错误或违反公平性，这些问题需要对其输入和输出之间的复杂依赖关系进行推理。这些问题通常只有在部署之后，在它们对生产造成危害之后才会被发现。我们展示了ArgusEyes，这是一个系统，它使数据科学家能够主动筛选他们的ML管道中与数据相关的问题，作为持续集成的一部分。ArgusEyes检测、执行和筛选ML管道，以发现声明式指定的管道问题，并分析数据工件及其来源，以便在部署到生产环境之前及早发现潜在问题。我们在三个场景中演示了我们的系统:在计算机视觉管道中检测错误标记的图像，在价格预测管道中发现数据泄漏，以及在信用评分管道中解决公平违规问题。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Proactively Screening Machine Learning Pipelines with ARGUSEYES

Software systems that learn from data with machine learning (ML) are ubiquitous. ML pipelines in these applications often suffer from a variety of data-related issues, such as data leakage, label errors or fairness violations, which require reasoning about complex dependencies between their inputs and outputs. These issues are usually only detected in hindsight after deployment, after they caused harm in production. We demonstrate ArgusEyes, a system which enables data scientists to proactively screen their ML pipelines for data-related issues as part of continuous integration. ArgusEyes instruments, executes and screens ML pipelines for declaratively specified pipeline issues, and analyzes data artifacts and their provenance to catch potential problems early before deployment to production. We demonstrate our system for three scenarios: detecting mislabeled images in a computer vision pipeline, spotting data leakage in a price prediction pipeline, and addressing fairness violations in a credit scoring pipeline.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Companion of the 2023 International Conference on Management of Data

自引率

0.00%

发文量