基于spark的科学工作流中来源捕获和数据分析的实用路线图

2018 IEEE/ACM Workflows in Support of Large-Scale Science (WORKS) Pub Date : 2018-11-01 DOI:10.1109/WORKS.2018.00009

Thaylon Guedes, V. Silva, M. Mattoso, Marcos V. N. Bedo, Daniel de Oliveira

{"title":"基于spark的科学工作流中来源捕获和数据分析的实用路线图","authors":"Thaylon Guedes, V. Silva, M. Mattoso, Marcos V. N. Bedo, Daniel de Oliveira","doi":"10.1109/WORKS.2018.00009","DOIUrl":null,"url":null,"abstract":"Whenever high-performance computing applications meet data-intensive scalable systems, an attractive approach is the use of Apache Spark for the management of scientific workflows. Spark provides several advantages such as being widely supported and granting efficient in-memory data management for large-scale applications. However, Spark still lacks support for data tracking and workflow provenance. Additionally, Spark's memory management requires accessing all data movements between the workflow activities. Therefore, the running of legacy programs on Spark is interpreted as a \"black-box\" activity, which prevents the capture and analysis of implicit data movements. Here, we present SAMbA, an Apache Spark extension for the gathering of prospective and retrospective provenance and domain data within distributed scientific workflows. Our approach relies on enveloping both RDD structure and data contents at runtime so that (i) RDD-enclosure consumed and produced data are captured and registered by SAMbA in a structured way, and (ii) provenance data can be queried during and after the execution of scientific workflows. By following the W3C PROV representation, we model the roles of RDD regarding prospective and retrospective provenance data. Our solution provides mechanisms for the capture and storage of provenance data without jeopardizing Spark's performance. The provenance retrieval capabilities of our proposal are evaluated in a practical case study, in which data analytics are provided by several SAMbA parameterizations.","PeriodicalId":154317,"journal":{"name":"2018 IEEE/ACM Workflows in Support of Large-Scale Science (WORKS)","volume":"266 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2018-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"13","resultStr":"{\"title\":\"A Practical Roadmap for Provenance Capture and Data Analysis in Spark-Based Scientific Workflows\",\"authors\":\"Thaylon Guedes, V. Silva, M. Mattoso, Marcos V. N. Bedo, Daniel de Oliveira\",\"doi\":\"10.1109/WORKS.2018.00009\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Whenever high-performance computing applications meet data-intensive scalable systems, an attractive approach is the use of Apache Spark for the management of scientific workflows. Spark provides several advantages such as being widely supported and granting efficient in-memory data management for large-scale applications. However, Spark still lacks support for data tracking and workflow provenance. Additionally, Spark's memory management requires accessing all data movements between the workflow activities. Therefore, the running of legacy programs on Spark is interpreted as a \\\"black-box\\\" activity, which prevents the capture and analysis of implicit data movements. Here, we present SAMbA, an Apache Spark extension for the gathering of prospective and retrospective provenance and domain data within distributed scientific workflows. Our approach relies on enveloping both RDD structure and data contents at runtime so that (i) RDD-enclosure consumed and produced data are captured and registered by SAMbA in a structured way, and (ii) provenance data can be queried during and after the execution of scientific workflows. By following the W3C PROV representation, we model the roles of RDD regarding prospective and retrospective provenance data. Our solution provides mechanisms for the capture and storage of provenance data without jeopardizing Spark's performance. The provenance retrieval capabilities of our proposal are evaluated in a practical case study, in which data analytics are provided by several SAMbA parameterizations.\",\"PeriodicalId\":154317,\"journal\":{\"name\":\"2018 IEEE/ACM Workflows in Support of Large-Scale Science (WORKS)\",\"volume\":\"266 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2018-11-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"13\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2018 IEEE/ACM Workflows in Support of Large-Scale Science (WORKS)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/WORKS.2018.00009\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2018 IEEE/ACM Workflows in Support of Large-Scale Science (WORKS)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/WORKS.2018.00009","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 13

摘要

每当高性能计算应用程序遇到数据密集型可扩展系统时，一个有吸引力的方法是使用Apache Spark来管理科学工作流。Spark提供了一些优势，比如得到了广泛的支持，并为大规模应用程序提供了高效的内存数据管理。然而，Spark仍然缺乏对数据跟踪和工作流来源的支持。此外，Spark的内存管理需要访问工作流活动之间的所有数据移动。因此，在Spark上运行遗留程序被解释为一种“黑盒”活动，它阻止了对隐式数据移动的捕获和分析。在这里，我们介绍SAMbA，这是一个Apache Spark扩展，用于在分布式科学工作流中收集前瞻性和回顾性的来源和领域数据。我们的方法依赖于在运行时封装RDD结构和数据内容，以便(i)使用和生成的RDD封装数据由SAMbA以结构化的方式捕获和注册，以及(ii)在执行科学工作流期间和之后可以查询来源数据。通过遵循W3C PROV表示，我们对RDD的角色进行了建模，这些角色涉及前瞻性和回顾性的来源数据。我们的解决方案提供了在不损害Spark性能的情况下捕获和存储源数据的机制。我们的建议的来源检索功能在一个实际案例研究中进行了评估，其中数据分析是由几个SAMbA参数化提供的。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

A Practical Roadmap for Provenance Capture and Data Analysis in Spark-Based Scientific Workflows

Whenever high-performance computing applications meet data-intensive scalable systems, an attractive approach is the use of Apache Spark for the management of scientific workflows. Spark provides several advantages such as being widely supported and granting efficient in-memory data management for large-scale applications. However, Spark still lacks support for data tracking and workflow provenance. Additionally, Spark's memory management requires accessing all data movements between the workflow activities. Therefore, the running of legacy programs on Spark is interpreted as a "black-box" activity, which prevents the capture and analysis of implicit data movements. Here, we present SAMbA, an Apache Spark extension for the gathering of prospective and retrospective provenance and domain data within distributed scientific workflows. Our approach relies on enveloping both RDD structure and data contents at runtime so that (i) RDD-enclosure consumed and produced data are captured and registered by SAMbA in a structured way, and (ii) provenance data can be queried during and after the execution of scientific workflows. By following the W3C PROV representation, we model the roles of RDD regarding prospective and retrospective provenance data. Our solution provides mechanisms for the capture and storage of provenance data without jeopardizing Spark's performance. The provenance retrieval capabilities of our proposal are evaluated in a practical case study, in which data analytics are provided by several SAMbA parameterizations.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2018 IEEE/ACM Workflows in Support of Large-Scale Science (WORKS)

自引率

0.00%

发文量