在数据密集型科学工作流中实现持久队列的方法

2011 IEEE World Congress on Services Pub Date : 2011-07-04 DOI:10.1109/SERVICES.2011.57

Michael Agun, S. Bowers

{"title":"在数据密集型科学工作流中实现持久队列的方法","authors":"Michael Agun, S. Bowers","doi":"10.1109/SERVICES.2011.57","DOIUrl":null,"url":null,"abstract":"Many scientific workflow systems are built on dataflow-based models of computation in which data drives the execution of workflow components. An advantage of using dataflow models is their straightforward semantics (which includes support for branching, merging, and looping) and their ability to concurrently execute workflow steps. However, for many data-intensive workflows the dataflow model often requires data buffering. Current systems largely perform buffering through in-memory queues which can lead to buffer overflow and performance degradation as queues reach capacity (e.g., because of paging). We describe an alternative framework that leverages external storage to implement buffers (which we refer to as persistent queues) within data-intensive scientific workflows. Our framework can easily be used with different underlying storage technologies, and we consider and evaluate three distinct approaches: a traditional relational database implementation, a non-relational implementation designed for fast reads and writes, and a specialized approach that can further reduce external buffering overhead. In addition, the use of persistent queues can provide detailed provenance information ``for free'' by capturing the input and output information of each workflow component during workflow execution. Although many systems provide such provenance information, we show how this information can be captured both efficiently and can be used to improve overall workflow performance through persistent queues.","PeriodicalId":429726,"journal":{"name":"2011 IEEE World Congress on Services","volume":"62 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2011-07-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"4","resultStr":"{\"title\":\"Approaches for Implementing Persistent Queues within Data-Intensive Scientific Workflows\",\"authors\":\"Michael Agun, S. Bowers\",\"doi\":\"10.1109/SERVICES.2011.57\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Many scientific workflow systems are built on dataflow-based models of computation in which data drives the execution of workflow components. An advantage of using dataflow models is their straightforward semantics (which includes support for branching, merging, and looping) and their ability to concurrently execute workflow steps. However, for many data-intensive workflows the dataflow model often requires data buffering. Current systems largely perform buffering through in-memory queues which can lead to buffer overflow and performance degradation as queues reach capacity (e.g., because of paging). We describe an alternative framework that leverages external storage to implement buffers (which we refer to as persistent queues) within data-intensive scientific workflows. Our framework can easily be used with different underlying storage technologies, and we consider and evaluate three distinct approaches: a traditional relational database implementation, a non-relational implementation designed for fast reads and writes, and a specialized approach that can further reduce external buffering overhead. In addition, the use of persistent queues can provide detailed provenance information ``for free'' by capturing the input and output information of each workflow component during workflow execution. Although many systems provide such provenance information, we show how this information can be captured both efficiently and can be used to improve overall workflow performance through persistent queues.\",\"PeriodicalId\":429726,\"journal\":{\"name\":\"2011 IEEE World Congress on Services\",\"volume\":\"62 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2011-07-04\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"4\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2011 IEEE World Congress on Services\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/SERVICES.2011.57\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2011 IEEE World Congress on Services","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/SERVICES.2011.57","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 4

摘要

许多科学工作流系统建立在基于数据流的计算模型之上，其中数据驱动工作流组件的执行。使用数据流模型的一个优点是其简单的语义(包括对分支、合并和循环的支持)以及并发执行工作流步骤的能力。然而，对于许多数据密集型工作流，数据流模型通常需要数据缓冲。当前系统主要通过内存队列执行缓冲，当队列达到容量时(例如，由于分页)，这可能导致缓冲区溢出和性能下降。我们描述了一个替代框架，它利用外部存储来实现数据密集型科学工作流中的缓冲区(我们称之为持久队列)。我们的框架可以很容易地与不同的底层存储技术一起使用，我们考虑并评估了三种不同的方法:传统的关系数据库实现、专为快速读写而设计的非关系数据库实现和可以进一步减少外部缓冲开销的专用方法。此外，通过在工作流执行期间捕获每个工作流组件的输入和输出信息，使用持久队列可以“免费”提供详细的来源信息。尽管许多系统都提供了这样的来源信息，但我们将展示如何有效地捕获这些信息，并通过持久队列提高总体工作流性能。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Approaches for Implementing Persistent Queues within Data-Intensive Scientific Workflows

Many scientific workflow systems are built on dataflow-based models of computation in which data drives the execution of workflow components. An advantage of using dataflow models is their straightforward semantics (which includes support for branching, merging, and looping) and their ability to concurrently execute workflow steps. However, for many data-intensive workflows the dataflow model often requires data buffering. Current systems largely perform buffering through in-memory queues which can lead to buffer overflow and performance degradation as queues reach capacity (e.g., because of paging). We describe an alternative framework that leverages external storage to implement buffers (which we refer to as persistent queues) within data-intensive scientific workflows. Our framework can easily be used with different underlying storage technologies, and we consider and evaluate three distinct approaches: a traditional relational database implementation, a non-relational implementation designed for fast reads and writes, and a specialized approach that can further reduce external buffering overhead. In addition, the use of persistent queues can provide detailed provenance information ``for free'' by capturing the input and output information of each workflow component during workflow execution. Although many systems provide such provenance information, we show how this information can be captured both efficiently and can be used to improve overall workflow performance through persistent queues.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2011 IEEE World Congress on Services

自引率

0.00%

发文量