TR-Spark:大数据分析的瞬态计算

Proceedings of the Seventh ACM Symposium on Cloud Computing Pub Date : 2016-10-05 DOI:10.1145/2987550.2987576

Ying Yan, Yanjie Gao, Yang Chen, Zhongxin Guo, Bole Chen, T. Moscibroda

{"title":"TR-Spark:大数据分析的瞬态计算","authors":"Ying Yan, Yanjie Gao, Yang Chen, Zhongxin Guo, Bole Chen, T. Moscibroda","doi":"10.1145/2987550.2987576","DOIUrl":null,"url":null,"abstract":"Large-scale public cloud providers invest billions of dollars into their cloud infrastructure and operate hundreds of thousands of servers across the globe. For various reasons, much of this provisioned server capacity runs at low average utilization, and there is tremendous competitive pressure to increase utilization. Conceptually, the way to increase utilization is clear: Run time-insensitive batch-job workloads as secondary background tasks whenever server capacity is underutilized; and evict these workloads when the server's primary task requires more resources. Big data analytic tasks would seem to be an ideal fit to run opportunistically on such transient resources in the cloud. In reality, however, modern distributed data processing systems such as MapReduce or Spark are designed to run as the primary task on dedicated hardware, and they perform badly on transiently available resources because of the excessive cost of cascading re-computations in case of evictions. In this paper, we propose a new framework for big data analytics on transient resources. Specifically, we design and implement TR-Spark, a version of Spark that can run highly efficiently as a secondary background task on transient (evictable) resources. The design of TR-Spark is based on two principles: resource stability and data size reduction-aware scheduling and lineage-aware checkpointing. The combination of these principles allows TR-Spark to naturally adapt to the stability characteristics of the underlying compute infrastructure. Evaluation results show that while regular Spark effectively fails to finish a job in clusters of even moderate instability, TR-Spark performs nearly as well as Spark running on stable resources.","PeriodicalId":362207,"journal":{"name":"Proceedings of the Seventh ACM Symposium on Cloud Computing","volume":"104 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2016-10-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"94","resultStr":"{\"title\":\"TR-Spark: Transient Computing for Big Data Analytics\",\"authors\":\"Ying Yan, Yanjie Gao, Yang Chen, Zhongxin Guo, Bole Chen, T. Moscibroda\",\"doi\":\"10.1145/2987550.2987576\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Large-scale public cloud providers invest billions of dollars into their cloud infrastructure and operate hundreds of thousands of servers across the globe. For various reasons, much of this provisioned server capacity runs at low average utilization, and there is tremendous competitive pressure to increase utilization. Conceptually, the way to increase utilization is clear: Run time-insensitive batch-job workloads as secondary background tasks whenever server capacity is underutilized; and evict these workloads when the server's primary task requires more resources. Big data analytic tasks would seem to be an ideal fit to run opportunistically on such transient resources in the cloud. In reality, however, modern distributed data processing systems such as MapReduce or Spark are designed to run as the primary task on dedicated hardware, and they perform badly on transiently available resources because of the excessive cost of cascading re-computations in case of evictions. In this paper, we propose a new framework for big data analytics on transient resources. Specifically, we design and implement TR-Spark, a version of Spark that can run highly efficiently as a secondary background task on transient (evictable) resources. The design of TR-Spark is based on two principles: resource stability and data size reduction-aware scheduling and lineage-aware checkpointing. The combination of these principles allows TR-Spark to naturally adapt to the stability characteristics of the underlying compute infrastructure. Evaluation results show that while regular Spark effectively fails to finish a job in clusters of even moderate instability, TR-Spark performs nearly as well as Spark running on stable resources.\",\"PeriodicalId\":362207,\"journal\":{\"name\":\"Proceedings of the Seventh ACM Symposium on Cloud Computing\",\"volume\":\"104 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2016-10-05\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"94\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the Seventh ACM Symposium on Cloud Computing\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/2987550.2987576\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the Seventh ACM Symposium on Cloud Computing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/2987550.2987576","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 94

摘要

大型公共云提供商在其云基础设施上投资数十亿美元，并在全球范围内运营数十万台服务器。由于各种原因，大部分已配置的服务器容量以较低的平均利用率运行，并且存在提高利用率的巨大竞争压力。从概念上讲，提高利用率的方法是明确的:在服务器容量未充分利用时，将时间不敏感的批处理工作负载作为辅助后台任务运行;当服务器的主要任务需要更多资源时，将这些工作负载驱逐出去。大数据分析任务似乎非常适合在云中的这种临时资源上投机地运行。然而，在现实中，现代分布式数据处理系统(如MapReduce或Spark)被设计为在专用硬件上作为主要任务运行，它们在瞬时可用资源上的性能很差，因为在驱逐情况下级联重新计算的成本过高。本文提出了一种暂态资源大数据分析的新框架。具体来说，我们设计并实现了TR-Spark，这是一个Spark版本，可以作为临时(可驱逐)资源上的次要后台任务高效运行。TR-Spark的设计基于两个原则:资源稳定性和数据大小缩减感知调度和继承感知检查点。这些原则的结合使TR-Spark能够自然地适应底层计算基础设施的稳定性特征。评估结果表明，虽然常规Spark在中度不稳定的集群中实际上无法完成任务，但TR-Spark的性能几乎与在稳定资源上运行的Spark一样好。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

TR-Spark: Transient Computing for Big Data Analytics

Large-scale public cloud providers invest billions of dollars into their cloud infrastructure and operate hundreds of thousands of servers across the globe. For various reasons, much of this provisioned server capacity runs at low average utilization, and there is tremendous competitive pressure to increase utilization. Conceptually, the way to increase utilization is clear: Run time-insensitive batch-job workloads as secondary background tasks whenever server capacity is underutilized; and evict these workloads when the server's primary task requires more resources. Big data analytic tasks would seem to be an ideal fit to run opportunistically on such transient resources in the cloud. In reality, however, modern distributed data processing systems such as MapReduce or Spark are designed to run as the primary task on dedicated hardware, and they perform badly on transiently available resources because of the excessive cost of cascading re-computations in case of evictions. In this paper, we propose a new framework for big data analytics on transient resources. Specifically, we design and implement TR-Spark, a version of Spark that can run highly efficiently as a secondary background task on transient (evictable) resources. The design of TR-Spark is based on two principles: resource stability and data size reduction-aware scheduling and lineage-aware checkpointing. The combination of these principles allows TR-Spark to naturally adapt to the stability characteristics of the underlying compute infrastructure. Evaluation results show that while regular Spark effectively fails to finish a job in clusters of even moderate instability, TR-Spark performs nearly as well as Spark running on stable resources.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Proceedings of the Seventh ACM Symposium on Cloud Computing

自引率

0.00%

发文量