Proceedings of the Sixth Workshop on Data Management for End-To-End Machine Learning最新文献

dcbench dcbench

Proceedings of the Sixth Workshop on Data Management for End-To-End Machine Learning Pub Date : 2022-06-12 DOI: 10.1145/3533028.3533310

Sabri Eyuboglu, Bojan Karlas, Christopher Ré, Ce Zhang, James Zou

引用次数: 12

LLVM code optimisation for automatic differentiation: when forward and reverse mode lead in the same direction LLVM代码优化自动区分:当向前和反向模式导致在同一方向

Proceedings of the Sixth Workshop on Data Management for End-To-End Machine Learning Pub Date : 2022-06-12 DOI: 10.1145/3533028.3533302

Maximilian E. Schüle, M. Springer, A. Kemper

引用次数: 1

Towards data-centric what-if analysis for native machine learning pipelines 面向本地机器学习管道的以数据为中心的假设分析

Proceedings of the Sixth Workshop on Data Management for End-To-End Machine Learning Pub Date : 2022-06-12 DOI: 10.1145/3533028.3533303

Stefan Grafberger, Paul Groth, Sebastian Schelter

引用次数: 4

Learning-to-learn efficiently with self-learning 学会通过自学有效地学习

Proceedings of the Sixth Workshop on Data Management for End-To-End Machine Learning Pub Date : 2022-06-12 DOI: 10.1145/3533028.3533307

Shruti Kunde, Sharod Roy Choudhury, Amey Pandit, Rekha Singhal

{"title":"Learning-to-learn efficiently with self-learning","authors":"Shruti Kunde, Sharod Roy Choudhury, Amey Pandit, Rekha Singhal","doi":"10.1145/3533028.3533307","DOIUrl":"https://doi.org/10.1145/3533028.3533307","url":null,"abstract":"Digital Twins of industrial process plants enable various what-if and if-what scenarios of the plants' functioning for fault diagnosis and general monitoring in the real-world. They do so through machine learning (ML) models built using data from sensors fitted in the plant. Over time, environmental factors cause variations in sensor readings, adversely affecting quality of the models' predictions. This triggers the self-learning loop, leading to the re-tuning/re-training of models. Reducing the time spent in self-learning of the models is a challenging task since there exist multiple models that need to be trained repeatedly using multiple algorithms which translates into large training time. We propose a metalearner which recommends the optimal regression algorithm for a model, thereby eliminating the need for training the model on multiple algorithms for every self-learning instance. The metalearner is trained on metafeatures extracted from the data which makes it application agnostic. We introduce domain metafeatures, which enhance metalearner prediction accuracy and propose machine learning and deep learning based approaches for selecting optimal metafeatures. To ensure relevance of selected metafeatures, we introduce novel static and dynamic reward functions for dynamic metafeature selection using a Q-Learning based approach. Our metalearning approach accelerates the time for determining the optimal regressor among 5 potential regressors from 5X to 27X over the traditional self-learning approaches. The incremental pre-processing approach achieves a speed-up of 25X over the traditional approach. The proposed metalearner achieves an AUC of 0.989, 0.954 and 0.998 for ML, DL and RL based approaches for metafeature selection respectively. We illustrate our findings on 3 datasets from the industrial process domain.","PeriodicalId":345888,"journal":{"name":"Proceedings of the Sixth Workshop on Data Management for End-To-End Machine Learning","volume":"40 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-06-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133245521","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

Accelerating container-based deep learning hyperparameter optimization workloads 加速基于容器的深度学习超参数优化工作负载

Proceedings of the Sixth Workshop on Data Management for End-To-End Machine Learning Pub Date : 2022-06-12 DOI: 10.1145/3533028.3533309

Rui Liu, David Wong, David J. Lange, Patrik Larsson, Vinay Jethava, Qing Zheng

{"title":"Accelerating container-based deep learning hyperparameter optimization workloads","authors":"Rui Liu, David Wong, David J. Lange, Patrik Larsson, Vinay Jethava, Qing Zheng","doi":"10.1145/3533028.3533309","DOIUrl":"https://doi.org/10.1145/3533028.3533309","url":null,"abstract":"DocuSign is advancing at a great pace for artificial intelligence and embracing a continuous shift towards developing and deploying an increasing number of deep learning models. During the development stage, developers usually build a number of deep learning models and train them using a bunch of potential hyperparameter configurations to find the best-performed one, which is called hyperparameter optimization (HPO). Such HPO jobs can run for a long time due to ever-larger models and numerous hyperparameter configurations. Furthermore, the HPO jobs at DocuSign are processed in container-based environments so that the best-performed model can be deployed and maintained in production reliably and efficiently. The workload consists of the long-running and containerized HPO jobs that can saturate the current machine learning infrastructure in DocuSign rapidly, but the key resource (e.g., GPU memory or computing unit) are not always full utilized, for example, some hyperparameter configurations may only take a fraction of the GPU memory but will occupy the entire device due to containerization. Suffering from this issue, the users may have to either wait or manually coordinate with others for the resource to run the jobs, and such HPO workloads often take an unexpectedly long time to be completed. To address this problem, we propose Relish, a system designed specifically to accelerate HPO workloads by segmenting HPO jobs and efficiently sharing GPU resources in container-based environments so that multiple containerized segmented jobs can be executed in parallel. We conduct an HPO workload based on a three-month-long trace from a multi-tenant GPU cluster of a research and development team in DocuSign to evaluate Relish, the results demonstrate that Relish can significantly improve GPU utilization and accelerate the workload through efficient multiple jobs execution.","PeriodicalId":345888,"journal":{"name":"Proceedings of the Sixth Workshop on Data Management for End-To-End Machine Learning","volume":"40 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-06-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124674556","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

Evaluating model serving strategies over streaming data 评估流数据上的模型服务策略

Proceedings of the Sixth Workshop on Data Management for End-To-End Machine Learning Pub Date : 2022-06-12 DOI: 10.1145/3533028.3533308

Sonia-Florina Horchidan, Emmanouil Kritharakis, Vasiliki Kalavri, Paris Carbone

引用次数: 3

How I stopped worrying about training data bugs and started complaining 我是如何不再担心训练数据错误而开始抱怨的

Proceedings of the Sixth Workshop on Data Management for End-To-End Machine Learning Pub Date : 2022-06-12 DOI: 10.1145/3533028.3533305

Lampros Flokas, Weiyuan Wu, Jiannan Wang, Nakul Verma, Eugene Wu

引用次数: 0

GouDa - generation of universal data sets: improving analysis and evaluation of data preparation pipelines 通用数据集的GouDa生成:改进数据准备管道的分析和评估

Proceedings of the Sixth Workshop on Data Management for End-To-End Machine Learning Pub Date : 2022-06-12 DOI: 10.1145/3533028.3533311

Valerie Restat, Gerrit Boerner, Andrew P. Conrad, U. Störl

引用次数: 1

Minun

Proceedings of the Sixth Workshop on Data Management for End-To-End Machine Learning Pub Date : 2022-06-12 DOI: 10.1145/3533028.3533304

Jin Wang, Yuliang Li

{"title":"Minun","authors":"Jin Wang, Yuliang Li","doi":"10.1145/3533028.3533304","DOIUrl":"https://doi.org/10.1145/3533028.3533304","url":null,"abstract":"Entity Matching (EM) is an important problem in data integration and cleaning. More recently, deep learning techniques, especially pre-trained language models, have been integrated into EM applications and achieved promising results. Unfortunately, the significant performance gain comes with the loss of explainability and transparency, deterring EM from the requirement of responsible data management. To address this issue, recent studies extended explainable AI techniques to explain black-box EM models. However, these solutions have the major drawbacks that (i) their explanations do not capture the unique semantics characteristics of the EM problem; and (ii) they fail to provide an objective method to quantitatively evaluate the provided explanations. In this paper, we propose Minun, a model-agnostic method to generate explanations for EM solutions. We utilize counterfactual examples generated from an EM customized search space as the explanations and develop two search algorithms to efficiently find such results. We also come up with a novel evaluation framework based on a student-teacher paradigm. The framework enables the evaluation of explanations of diverse formats by capturing the performance gain of a \"student\" model at simulating the target \"teacher\" model when explanations are given as side input. We conduct an extensive set of experiments on explaining state-of-the-art deep EM models on popular EM benchmark datasets. The results demonstrate that Minun significantly outperforms popular explainable AI methods such as LIME and SHAP on both explanation quality and scalability.","PeriodicalId":345888,"journal":{"name":"Proceedings of the Sixth Workshop on Data Management for End-To-End Machine Learning","volume":"34 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-06-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129512678","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2