计算科学与工程中机器学习生命周期中的来源数据

2019 IEEE/ACM Workflows in Support of Large-Scale Science (WORKS) Pub Date : 2019-10-09 DOI:10.1109/WORKS49585.2019.00006

Renan Souza, L. Azevedo, Vítor Lourenço, E. Soares, R. Thiago, R. Brandão, D. Civitarese, E. V. Brazil, M. Moreno, P. Valduriez, M. Mattoso, Renato Cerqueira, M. Netto

{"title":"计算科学与工程中机器学习生命周期中的来源数据","authors":"Renan Souza, L. Azevedo, Vítor Lourenço, E. Soares, R. Thiago, R. Brandão, D. Civitarese, E. V. Brazil, M. Moreno, P. Valduriez, M. Mattoso, Renato Cerqueira, M. Netto","doi":"10.1109/WORKS49585.2019.00006","DOIUrl":null,"url":null,"abstract":"Machine Learning (ML) has become essential in several industries. In Computational Science and Engineering (CSE), the complexity of the ML lifecycle comes from the large variety of data, scientists' expertise, tools, and workflows. If data are not tracked properly during the lifecycle, it becomes unfeasible to recreate a ML model from scratch or to explain to stackholders how it was created. The main limitation of provenance tracking solutions is that they cannot cope with provenance capture and integration of domain and ML data processed in the multiple workflows in the lifecycle, while keeping the provenance capture overhead low. To handle this problem, in this paper we contribute with a detailed characterization of provenance data in the ML lifecycle in CSE; a new provenance data representation, called PROV-ML, built on top of W3C PROV and ML Schema; and extensions to a system that tracks provenance from multiple workflows to address the characteristics of ML and CSE, and to allow for provenance queries with a standard vocabulary. We show a practical use in a real case in the O&G industry, along with its evaluation using 239,616 CUDA cores in parallel.","PeriodicalId":436926,"journal":{"name":"2019 IEEE/ACM Workflows in Support of Large-Scale Science (WORKS)","volume":"32 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2019-10-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"25","resultStr":"{\"title\":\"Provenance Data in the Machine Learning Lifecycle in Computational Science and Engineering\",\"authors\":\"Renan Souza, L. Azevedo, Vítor Lourenço, E. Soares, R. Thiago, R. Brandão, D. Civitarese, E. V. Brazil, M. Moreno, P. Valduriez, M. Mattoso, Renato Cerqueira, M. Netto\",\"doi\":\"10.1109/WORKS49585.2019.00006\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Machine Learning (ML) has become essential in several industries. In Computational Science and Engineering (CSE), the complexity of the ML lifecycle comes from the large variety of data, scientists' expertise, tools, and workflows. If data are not tracked properly during the lifecycle, it becomes unfeasible to recreate a ML model from scratch or to explain to stackholders how it was created. The main limitation of provenance tracking solutions is that they cannot cope with provenance capture and integration of domain and ML data processed in the multiple workflows in the lifecycle, while keeping the provenance capture overhead low. To handle this problem, in this paper we contribute with a detailed characterization of provenance data in the ML lifecycle in CSE; a new provenance data representation, called PROV-ML, built on top of W3C PROV and ML Schema; and extensions to a system that tracks provenance from multiple workflows to address the characteristics of ML and CSE, and to allow for provenance queries with a standard vocabulary. We show a practical use in a real case in the O&G industry, along with its evaluation using 239,616 CUDA cores in parallel.\",\"PeriodicalId\":436926,\"journal\":{\"name\":\"2019 IEEE/ACM Workflows in Support of Large-Scale Science (WORKS)\",\"volume\":\"32 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2019-10-09\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"25\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2019 IEEE/ACM Workflows in Support of Large-Scale Science (WORKS)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/WORKS49585.2019.00006\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2019 IEEE/ACM Workflows in Support of Large-Scale Science (WORKS)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/WORKS49585.2019.00006","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 25

摘要

机器学习(ML)在许多行业已经变得必不可少。在计算科学与工程(CSE)中，机器学习生命周期的复杂性来自于大量的数据、科学家的专业知识、工具和工作流程。如果在生命周期中没有正确地跟踪数据，那么从头开始重新创建ML模型或向涉众解释它是如何创建的就变得不可行的。来源跟踪解决方案的主要限制是它们不能处理在生命周期中多个工作流中处理的来源捕获和领域和ML数据的集成，同时保持较低的来源捕获开销。为了解决这一问题，本文对CSE中ML生命周期中的来源数据进行了详细的描述;一种新的来源数据表示，称为provi -ML，建立在W3C provv和ML模式之上;以及对系统的扩展，该系统跟踪来自多个工作流的来源，以解决ML和CSE的特征，并允许使用标准词汇表进行来源查询。我们在油气行业的一个实际案例中展示了它的实际应用，以及并行使用239,616个CUDA内核的评估。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Provenance Data in the Machine Learning Lifecycle in Computational Science and Engineering

Machine Learning (ML) has become essential in several industries. In Computational Science and Engineering (CSE), the complexity of the ML lifecycle comes from the large variety of data, scientists' expertise, tools, and workflows. If data are not tracked properly during the lifecycle, it becomes unfeasible to recreate a ML model from scratch or to explain to stackholders how it was created. The main limitation of provenance tracking solutions is that they cannot cope with provenance capture and integration of domain and ML data processed in the multiple workflows in the lifecycle, while keeping the provenance capture overhead low. To handle this problem, in this paper we contribute with a detailed characterization of provenance data in the ML lifecycle in CSE; a new provenance data representation, called PROV-ML, built on top of W3C PROV and ML Schema; and extensions to a system that tracks provenance from multiple workflows to address the characteristics of ML and CSE, and to allow for provenance queries with a standard vocabulary. We show a practical use in a real case in the O&G industry, along with its evaluation using 239,616 CUDA cores in parallel.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2019 IEEE/ACM Workflows in Support of Large-Scale Science (WORKS)

自引率

0.00%

发文量