优化洗牌广域数据分析

Shuhao Liu, Hao Wang, Baochun Li
{"title":"优化洗牌广域数据分析","authors":"Shuhao Liu, Hao Wang, Baochun Li","doi":"10.1109/ICDCS.2017.131","DOIUrl":null,"url":null,"abstract":"As increasingly large volumes of raw data are generated at geographically distributed datacenters, they need to be efficiently processed by data analytic jobs spanning multiple datacenters across wide-area networks. Designed for a single datacenter, existing data processing frameworks, such as Apache Spark, are not able to deliver satisfactory performance when these wide-area analytic jobs are executed. As wide-area networks interconnecting datacenters may not be congestion free, there is a compelling need for a new system framework that is optimized for wide-area data analytics. In this paper, we design and implement a new proactive data aggregation framework based on Apache Spark, with a focus on optimizing the network traffic incurred in shuffle stages of data analytic jobs. The objective of this framework is to strategically and proactively aggregate the output data of mapper tasks to a subset of worker datacenters, as a replacement to Spark's original passive fetch mechanism across datacenters. It improves the performance of wide-area analytic jobs by avoiding repetitive data transfers, which improves the utilization of inter-datacenter links. Our extensive experimental results using standard benchmarks across six Amazon EC2 regions have shown that our proposed framework is able to reduce job completion times by up to 73%, as compared to the existing baseline implementation in Spark.","PeriodicalId":127689,"journal":{"name":"2017 IEEE 37th International Conference on Distributed Computing Systems (ICDCS)","volume":"57 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2017-06-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"17","resultStr":"{\"title\":\"Optimizing Shuffle in Wide-Area Data Analytics\",\"authors\":\"Shuhao Liu, Hao Wang, Baochun Li\",\"doi\":\"10.1109/ICDCS.2017.131\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"As increasingly large volumes of raw data are generated at geographically distributed datacenters, they need to be efficiently processed by data analytic jobs spanning multiple datacenters across wide-area networks. Designed for a single datacenter, existing data processing frameworks, such as Apache Spark, are not able to deliver satisfactory performance when these wide-area analytic jobs are executed. As wide-area networks interconnecting datacenters may not be congestion free, there is a compelling need for a new system framework that is optimized for wide-area data analytics. In this paper, we design and implement a new proactive data aggregation framework based on Apache Spark, with a focus on optimizing the network traffic incurred in shuffle stages of data analytic jobs. The objective of this framework is to strategically and proactively aggregate the output data of mapper tasks to a subset of worker datacenters, as a replacement to Spark's original passive fetch mechanism across datacenters. It improves the performance of wide-area analytic jobs by avoiding repetitive data transfers, which improves the utilization of inter-datacenter links. Our extensive experimental results using standard benchmarks across six Amazon EC2 regions have shown that our proposed framework is able to reduce job completion times by up to 73%, as compared to the existing baseline implementation in Spark.\",\"PeriodicalId\":127689,\"journal\":{\"name\":\"2017 IEEE 37th International Conference on Distributed Computing Systems (ICDCS)\",\"volume\":\"57 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2017-06-05\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"17\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2017 IEEE 37th International Conference on Distributed Computing Systems (ICDCS)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ICDCS.2017.131\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2017 IEEE 37th International Conference on Distributed Computing Systems (ICDCS)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICDCS.2017.131","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 17

摘要

随着地理分布的数据中心产生越来越多的大量原始数据,需要跨广域网的多个数据中心的数据分析作业有效地处理这些数据。现有的数据处理框架(如Apache Spark)是为单个数据中心设计的,当执行这些广域分析作业时,它们无法提供令人满意的性能。由于连接数据中心的广域网可能不会没有拥塞,因此迫切需要针对广域数据分析进行优化的新系统框架。本文设计并实现了一个新的基于Apache Spark的主动数据聚合框架,重点对数据分析作业shuffle阶段产生的网络流量进行优化。该框架的目标是战略性地、主动地将mapper任务的输出数据聚合到工作数据中心的一个子集,作为Spark原始的跨数据中心被动获取机制的替代。它通过避免重复的数据传输提高了广域分析工作的性能,从而提高了数据中心间链路的利用率。我们在六个Amazon EC2区域使用标准基准进行了广泛的实验,结果表明,与Spark中现有的基线实现相比,我们提出的框架能够将作业完成时间减少73%。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
Optimizing Shuffle in Wide-Area Data Analytics
As increasingly large volumes of raw data are generated at geographically distributed datacenters, they need to be efficiently processed by data analytic jobs spanning multiple datacenters across wide-area networks. Designed for a single datacenter, existing data processing frameworks, such as Apache Spark, are not able to deliver satisfactory performance when these wide-area analytic jobs are executed. As wide-area networks interconnecting datacenters may not be congestion free, there is a compelling need for a new system framework that is optimized for wide-area data analytics. In this paper, we design and implement a new proactive data aggregation framework based on Apache Spark, with a focus on optimizing the network traffic incurred in shuffle stages of data analytic jobs. The objective of this framework is to strategically and proactively aggregate the output data of mapper tasks to a subset of worker datacenters, as a replacement to Spark's original passive fetch mechanism across datacenters. It improves the performance of wide-area analytic jobs by avoiding repetitive data transfers, which improves the utilization of inter-datacenter links. Our extensive experimental results using standard benchmarks across six Amazon EC2 regions have shown that our proposed framework is able to reduce job completion times by up to 73%, as compared to the existing baseline implementation in Spark.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信