Towards optimising distributed data streaming graphs using parallel streams

IEEE International Symposium on High-Performance Parallel Distributed Computing Pub Date : 2010-06-21 DOI:10.1145/1851476.1851583

C. Liew, M. Atkinson, Jano van Hemert, Liangxiu Han

{"title":"Towards optimising distributed data streaming graphs using parallel streams","authors":"C. Liew, M. Atkinson, Jano van Hemert, Liangxiu Han","doi":"10.1145/1851476.1851583","DOIUrl":null,"url":null,"abstract":"Modern scientific collaborations have opened up the opportunity of solving complex problems that involve multi-disciplinary expertise and large-scale computational experiments. These experiments usually involve large amounts of data that are located in distributed data repositories running various software systems, and managed by different organisations. A common strategy to make the experiments more manageable is executing the processing steps as a workflow. In this paper, we look into the implementation of fine-grained data-flow between computational elements in a scientific workflow as streams. We model the distributed computation as a directed acyclic graph where the nodes represent the processing elements that incrementally implement specific subtasks. The processing elements are connected in a pipelined streaming manner, which allows task executions to overlap. We further optimise the execution by splitting pipelines across processes and by introducing extra parallel streams. We identify performance metrics and design a measurement tool to evaluate each enactment. We conducted experiments to evaluate our optimisation strategies with a real world problem in the Life Sciences---EURExpress-II. The paper presents our distributed data-handling model, the optimisation and instrumentation strategies and the evaluation experiments. We demonstrate linear speed up and argue that this use of data-streaming to enable both overlapped pipeline and parallelised enactment is a generally applicable optimisation strategy.","PeriodicalId":330072,"journal":{"name":"IEEE International Symposium on High-Performance Parallel Distributed Computing","volume":"49 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2010-06-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"57","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE International Symposium on High-Performance Parallel Distributed Computing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/1851476.1851583","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 57

Abstract

Modern scientific collaborations have opened up the opportunity of solving complex problems that involve multi-disciplinary expertise and large-scale computational experiments. These experiments usually involve large amounts of data that are located in distributed data repositories running various software systems, and managed by different organisations. A common strategy to make the experiments more manageable is executing the processing steps as a workflow. In this paper, we look into the implementation of fine-grained data-flow between computational elements in a scientific workflow as streams. We model the distributed computation as a directed acyclic graph where the nodes represent the processing elements that incrementally implement specific subtasks. The processing elements are connected in a pipelined streaming manner, which allows task executions to overlap. We further optimise the execution by splitting pipelines across processes and by introducing extra parallel streams. We identify performance metrics and design a measurement tool to evaluate each enactment. We conducted experiments to evaluate our optimisation strategies with a real world problem in the Life Sciences---EURExpress-II. The paper presents our distributed data-handling model, the optimisation and instrumentation strategies and the evaluation experiments. We demonstrate linear speed up and argue that this use of data-streaming to enable both overlapped pipeline and parallelised enactment is a generally applicable optimisation strategy.

查看原文本刊更多论文

面向使用并行流优化分布式数据流图

现代科学合作为解决涉及多学科专业知识和大规模计算实验的复杂问题提供了机会。这些实验通常涉及大量数据，这些数据位于运行各种软件系统的分布式数据存储库中，由不同的组织管理。使实验更易于管理的一种常用策略是将处理步骤作为工作流执行。在本文中，我们研究了科学工作流中计算元素之间的细粒度数据流作为流的实现。我们将分布式计算建模为一个有向无环图，其中节点表示增量实现特定子任务的处理元素。处理元素以流水线流方式连接，这允许任务执行重叠。我们通过跨进程拆分管道和引入额外的并行流来进一步优化执行。我们确定绩效指标，并设计一个测量工具来评估每个法规。我们通过eureexpress - ii这一生命科学领域的实际问题进行了实验，以评估我们的优化策略。本文介绍了我们的分布式数据处理模型、优化和检测策略以及评估实验。我们展示了线性加速，并认为这种使用数据流来实现重叠管道和并行制定是一种普遍适用的优化策略。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

IEEE International Symposium on High-Performance Parallel Distributed Computing

自引率

0.00%

发文量