Proceedings of the Sixth Workshop on Data Management for End-To-End Machine Learning最新文献

筛选
英文 中文
dcbench dcbench
Sabri Eyuboglu, Bojan Karlas, Christopher Ré, Ce Zhang, James Zou
{"title":"dcbench","authors":"Sabri Eyuboglu, Bojan Karlas, Christopher Ré, Ce Zhang, James Zou","doi":"10.1145/3533028.3533310","DOIUrl":"https://doi.org/10.1145/3533028.3533310","url":null,"abstract":"The development workflow for today's AI applications has grown far beyond the standard model training task. This workflow typically consists of various data and model management tasks. It includes a \"data cycle\" aimed at producing high-quality training data, and a \"model cycle\" aimed at managing trained models on their way to production. This broadened workflow has opened a space for already emerging tools and systems for AI development. However, as a research community, we are still missing standardized ways to evaluate these tools and systems. In a humble effort to get this wheel turning, we developed dcbench, a benchmark for evaluating systems for data-centric AI development. In this report, we present the main ideas behind dcbench, some benchmark tasks that we included in the initial release, and a short summary of its implementation.","PeriodicalId":345888,"journal":{"name":"Proceedings of the Sixth Workshop on Data Management for End-To-End Machine Learning","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2022-06-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115840967","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 12
LLVM code optimisation for automatic differentiation: when forward and reverse mode lead in the same direction LLVM代码优化自动区分:当向前和反向模式导致在同一方向
Maximilian E. Schüle, M. Springer, A. Kemper
{"title":"LLVM code optimisation for automatic differentiation: when forward and reverse mode lead in the same direction","authors":"Maximilian E. Schüle, M. Springer, A. Kemper","doi":"10.1145/3533028.3533302","DOIUrl":"https://doi.org/10.1145/3533028.3533302","url":null,"abstract":"Both forward and reverse mode automatic differentiation derive a model function as used for gradient descent automatically. Reverse mode calculates all derivatives in one run, whereas forward mode requires rerunning the algorithm with respect to every variable for which the derivative is needed. To allow for in-database machine learning, we have integrated automatic differentiation as an SQL operator inside the Umbra database system. To benchmark code-generation to GPU, we implement forward as well as reverse mode automatic differentiation. The inspection of the optimised LLVM code shows that nearly the same machine code is executed after the generated LLVM code has been optimised. Thus, both modes yield similar runtimes but different compilation times.","PeriodicalId":345888,"journal":{"name":"Proceedings of the Sixth Workshop on Data Management for End-To-End Machine Learning","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2022-06-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132208158","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Towards data-centric what-if analysis for native machine learning pipelines 面向本地机器学习管道的以数据为中心的假设分析
Stefan Grafberger, Paul Groth, Sebastian Schelter
{"title":"Towards data-centric what-if analysis for native machine learning pipelines","authors":"Stefan Grafberger, Paul Groth, Sebastian Schelter","doi":"10.1145/3533028.3533303","DOIUrl":"https://doi.org/10.1145/3533028.3533303","url":null,"abstract":"An important task of data scientists is to understand the sensitivity of their models to changes in the data that the models are trained and tested upon. Currently, conducting such data-centric what-if analyses requires significant and costly manual development and testing with the corresponding chance for the introduction of bugs. We discuss the problem of data-centric what-if analysis over whole ML pipelines (including data preparation and feature encoding), propose optimisations that reuse trained models and intermediate data to reduce the runtime of such analysis, and finally conduct preliminary experiments on three complex example pipelines, where our approach reduces the runtime by a factor of up to six.","PeriodicalId":345888,"journal":{"name":"Proceedings of the Sixth Workshop on Data Management for End-To-End Machine Learning","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2022-06-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130990424","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
Learning-to-learn efficiently with self-learning 学会通过自学有效地学习
Shruti Kunde, Sharod Roy Choudhury, Amey Pandit, Rekha Singhal
{"title":"Learning-to-learn efficiently with self-learning","authors":"Shruti Kunde, Sharod Roy Choudhury, Amey Pandit, Rekha Singhal","doi":"10.1145/3533028.3533307","DOIUrl":"https://doi.org/10.1145/3533028.3533307","url":null,"abstract":"Digital Twins of industrial process plants enable various what-if and if-what scenarios of the plants' functioning for fault diagnosis and general monitoring in the real-world. They do so through machine learning (ML) models built using data from sensors fitted in the plant. Over time, environmental factors cause variations in sensor readings, adversely affecting quality of the models' predictions. This triggers the self-learning loop, leading to the re-tuning/re-training of models. Reducing the time spent in self-learning of the models is a challenging task since there exist multiple models that need to be trained repeatedly using multiple algorithms which translates into large training time. We propose a metalearner which recommends the optimal regression algorithm for a model, thereby eliminating the need for training the model on multiple algorithms for every self-learning instance. The metalearner is trained on metafeatures extracted from the data which makes it application agnostic. We introduce domain metafeatures, which enhance metalearner prediction accuracy and propose machine learning and deep learning based approaches for selecting optimal metafeatures. To ensure relevance of selected metafeatures, we introduce novel static and dynamic reward functions for dynamic metafeature selection using a Q-Learning based approach. Our metalearning approach accelerates the time for determining the optimal regressor among 5 potential regressors from 5X to 27X over the traditional self-learning approaches. The incremental pre-processing approach achieves a speed-up of 25X over the traditional approach. The proposed metalearner achieves an AUC of 0.989, 0.954 and 0.998 for ML, DL and RL based approaches for metafeature selection respectively. We illustrate our findings on 3 datasets from the industrial process domain.","PeriodicalId":345888,"journal":{"name":"Proceedings of the Sixth Workshop on Data Management for End-To-End Machine Learning","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2022-06-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133245521","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Accelerating container-based deep learning hyperparameter optimization workloads 加速基于容器的深度学习超参数优化工作负载
Rui Liu, David Wong, David J. Lange, Patrik Larsson, Vinay Jethava, Qing Zheng
{"title":"Accelerating container-based deep learning hyperparameter optimization workloads","authors":"Rui Liu, David Wong, David J. Lange, Patrik Larsson, Vinay Jethava, Qing Zheng","doi":"10.1145/3533028.3533309","DOIUrl":"https://doi.org/10.1145/3533028.3533309","url":null,"abstract":"DocuSign is advancing at a great pace for artificial intelligence and embracing a continuous shift towards developing and deploying an increasing number of deep learning models. During the development stage, developers usually build a number of deep learning models and train them using a bunch of potential hyperparameter configurations to find the best-performed one, which is called hyperparameter optimization (HPO). Such HPO jobs can run for a long time due to ever-larger models and numerous hyperparameter configurations. Furthermore, the HPO jobs at DocuSign are processed in container-based environments so that the best-performed model can be deployed and maintained in production reliably and efficiently. The workload consists of the long-running and containerized HPO jobs that can saturate the current machine learning infrastructure in DocuSign rapidly, but the key resource (e.g., GPU memory or computing unit) are not always full utilized, for example, some hyperparameter configurations may only take a fraction of the GPU memory but will occupy the entire device due to containerization. Suffering from this issue, the users may have to either wait or manually coordinate with others for the resource to run the jobs, and such HPO workloads often take an unexpectedly long time to be completed. To address this problem, we propose Relish, a system designed specifically to accelerate HPO workloads by segmenting HPO jobs and efficiently sharing GPU resources in container-based environments so that multiple containerized segmented jobs can be executed in parallel. We conduct an HPO workload based on a three-month-long trace from a multi-tenant GPU cluster of a research and development team in DocuSign to evaluate Relish, the results demonstrate that Relish can significantly improve GPU utilization and accelerate the workload through efficient multiple jobs execution.","PeriodicalId":345888,"journal":{"name":"Proceedings of the Sixth Workshop on Data Management for End-To-End Machine Learning","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2022-06-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124674556","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Evaluating model serving strategies over streaming data 评估流数据上的模型服务策略
Sonia-Florina Horchidan, Emmanouil Kritharakis, Vasiliki Kalavri, Paris Carbone
{"title":"Evaluating model serving strategies over streaming data","authors":"Sonia-Florina Horchidan, Emmanouil Kritharakis, Vasiliki Kalavri, Paris Carbone","doi":"10.1145/3533028.3533308","DOIUrl":"https://doi.org/10.1145/3533028.3533308","url":null,"abstract":"We present the first performance evaluation study of model serving integration tools in stream processing frameworks. Using Apache Flink as a representative stream processing system, we evaluate alternative Deep Learning serving pipelines for image classification. Our performance evaluation considers both the case of embedded use of Machine Learning libraries within stream tasks and that of external serving via Remote Procedure Calls. The results indicate superior throughput and scalability for pipelines that make use of embedded libraries to serve pre-trained models. Whereas, latency can vary across strategies, with external serving even achieving lower latency when network conditions are optimal due to better specialized use of underlying hardware. We discuss our findings and provide further motivating arguments towards research in the area of ML-native data streaming engines in the future.","PeriodicalId":345888,"journal":{"name":"Proceedings of the Sixth Workshop on Data Management for End-To-End Machine Learning","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2022-06-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123001520","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
How I stopped worrying about training data bugs and started complaining 我是如何不再担心训练数据错误而开始抱怨的
Lampros Flokas, Weiyuan Wu, Jiannan Wang, Nakul Verma, Eugene Wu
{"title":"How I stopped worrying about training data bugs and started complaining","authors":"Lampros Flokas, Weiyuan Wu, Jiannan Wang, Nakul Verma, Eugene Wu","doi":"10.1145/3533028.3533305","DOIUrl":"https://doi.org/10.1145/3533028.3533305","url":null,"abstract":"There is an increasing awareness of the gap between machine learning research and production. The research community has largely focused on developing a model that performs well on a validation set, but the production environment needs to make sure the model also performs well in a downstream application. The latter is more challenging because the test/inference-time data used in the application could be quite different from the training data. To address this challenge, we advocate for \"complaint-driven\" data debugging, which allows the user to complain about the unexpected behaviors of the model in the downstream application, and proposes interventions for training data errors that likely led to the complaints. This new debugging paradigm helps solve a range of training data quality problems such as labeling error, fairness, and data drift. We present our long-term vision, highlight achieved milestones, and outline a research roadmap including a number of open problems.","PeriodicalId":345888,"journal":{"name":"Proceedings of the Sixth Workshop on Data Management for End-To-End Machine Learning","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2022-06-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121520459","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
GouDa - generation of universal data sets: improving analysis and evaluation of data preparation pipelines 通用数据集的GouDa生成:改进数据准备管道的分析和评估
Valerie Restat, Gerrit Boerner, Andrew P. Conrad, U. Störl
{"title":"GouDa - generation of universal data sets: improving analysis and evaluation of data preparation pipelines","authors":"Valerie Restat, Gerrit Boerner, Andrew P. Conrad, U. Störl","doi":"10.1145/3533028.3533311","DOIUrl":"https://doi.org/10.1145/3533028.3533311","url":null,"abstract":"Data preparation is necessary to ensure data quality in machine learning-based decisions and data-driven systems. A variety of different tools exist to simplify this process. However, there is often a lack of suitable data sets to evaluate and compare existing tools and new research approaches. For this reason, we implemented GouDa, a tool for generating universal data sets. GouDa can be used to create data sets with arbitrary error types at arbitrary error rates. In addition to the data sets with automatically generated errors, ground truth is provided. Thus, GouDa can be used for the extensive analysis and evaluation of data preparation pipelines.","PeriodicalId":345888,"journal":{"name":"Proceedings of the Sixth Workshop on Data Management for End-To-End Machine Learning","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2022-06-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115926061","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Minun
Jin Wang, Yuliang Li
{"title":"Minun","authors":"Jin Wang, Yuliang Li","doi":"10.1145/3533028.3533304","DOIUrl":"https://doi.org/10.1145/3533028.3533304","url":null,"abstract":"Entity Matching (EM) is an important problem in data integration and cleaning. More recently, deep learning techniques, especially pre-trained language models, have been integrated into EM applications and achieved promising results. Unfortunately, the significant performance gain comes with the loss of explainability and transparency, deterring EM from the requirement of responsible data management. To address this issue, recent studies extended explainable AI techniques to explain black-box EM models. However, these solutions have the major drawbacks that (i) their explanations do not capture the unique semantics characteristics of the EM problem; and (ii) they fail to provide an objective method to quantitatively evaluate the provided explanations. In this paper, we propose Minun, a model-agnostic method to generate explanations for EM solutions. We utilize counterfactual examples generated from an EM customized search space as the explanations and develop two search algorithms to efficiently find such results. We also come up with a novel evaluation framework based on a student-teacher paradigm. The framework enables the evaluation of explanations of diverse formats by capturing the performance gain of a \"student\" model at simulating the target \"teacher\" model when explanations are given as side input. We conduct an extensive set of experiments on explaining state-of-the-art deep EM models on popular EM benchmark datasets. The results demonstrate that Minun significantly outperforms popular explainable AI methods such as LIME and SHAP on both explanation quality and scalability.","PeriodicalId":345888,"journal":{"name":"Proceedings of the Sixth Workshop on Data Management for End-To-End Machine Learning","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2022-06-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129512678","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
相关产品
×
本文献相关产品
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信