A Data Placement Strategy for Data-Intensive Scientific Workflows in Cloud

2015 15th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing Pub Date : 2015-05-04 DOI:10.1109/CCGrid.2015.72

Qing Zhao, Congcong Xiong, Xi Zhao, Ce Yu, Jian Xiao

{"title":"A Data Placement Strategy for Data-Intensive Scientific Workflows in Cloud","authors":"Qing Zhao, Congcong Xiong, Xi Zhao, Ce Yu, Jian Xiao","doi":"10.1109/CCGrid.2015.72","DOIUrl":null,"url":null,"abstract":"With the arrival of cloud computing and Big Data, many scientific applications with large amount of data can be abstracted as scientific workflows and running on a cloud environment. Distributing these datasets intelligently can decrease data transfers efficiently during the workflow's execution. In this paper, we proposed a 2- stage data placement strategy. In the initial stage, we cluster the datasets based on their correlation, and allocate these clusters onto data centers. Compared with existing works, we have incorporated the data size into correlation calculation, and have proposed a new type of data correlation for the intermediate data named \"the first order conduction correlation\". Hence the data transmission cost can be measured more reasonable. In the runtime stage, the re-distribution algorithm can adjust data layout according to the changed factors, and the overhead of re-layout itself has also been measured. Compared with previous work, simulation results show that our proposed strategy can effectively reduce the time consumption of data movements during the workflow execution.","PeriodicalId":6664,"journal":{"name":"2015 15th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing","volume":"11 1","pages":"928-934"},"PeriodicalIF":0.0000,"publicationDate":"2015-05-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"27","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2015 15th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/CCGrid.2015.72","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 27

Abstract

With the arrival of cloud computing and Big Data, many scientific applications with large amount of data can be abstracted as scientific workflows and running on a cloud environment. Distributing these datasets intelligently can decrease data transfers efficiently during the workflow's execution. In this paper, we proposed a 2- stage data placement strategy. In the initial stage, we cluster the datasets based on their correlation, and allocate these clusters onto data centers. Compared with existing works, we have incorporated the data size into correlation calculation, and have proposed a new type of data correlation for the intermediate data named "the first order conduction correlation". Hence the data transmission cost can be measured more reasonable. In the runtime stage, the re-distribution algorithm can adjust data layout according to the changed factors, and the overhead of re-layout itself has also been measured. Compared with previous work, simulation results show that our proposed strategy can effectively reduce the time consumption of data movements during the workflow execution.

查看原文本刊更多论文

云中数据密集型科学工作流的数据放置策略

随着云计算和大数据的到来，许多具有大量数据的科学应用可以抽象为科学工作流，并在云环境中运行。智能地分布这些数据集可以有效地减少工作流执行过程中的数据传输。在本文中，我们提出了一个两阶段的数据放置策略。在初始阶段，我们根据数据集的相关性对数据集进行聚类，并将这些聚类分配到数据中心。与已有工作相比，我们将数据大小纳入关联计算，并对中间数据提出了一种新的数据关联，称为“一阶传导关联”。从而可以更合理地衡量数据传输成本。在运行阶段，重新分配算法可以根据变化的因素调整数据布局，并且对重新分配本身的开销也进行了测量。仿真结果表明，该策略可以有效地减少工作流执行过程中数据移动的时间消耗。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2015 15th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing

自引率

0.00%

发文量