Sparkle: optimizing spark for large memory machines and analytics

Proceedings of the 2017 Symposium on Cloud Computing Pub Date : 2017-08-18 DOI:10.1145/3127479.3134762

Mijung Kim, Jun Yu Li, Haris Volos, M. Marwah, A. Ulanov, K. Keeton, Joseph A. Tucek, L. Cherkasova, Le Xu, Pradeep R. Fernando

引用次数: 15

Abstract

Given the growing availability of affordable scale-up servers, our goal is to bring the performance benefits of in-memory processing on scale-up servers to an increasingly common class of data analytics applications that process small to medium size datasets (up to a few 100GBs) that can easily fit in the memory of a typical scale-up server To achieve this goal, we leverage Spark, an existing memory-centric data analytics framework with wide-spread adoption among data scientists. Bringing Spark's data analytic capabilities to a scale-up system requires rethinking the original design assumptions, which, although effective for a scale-out system, are a poor match to a scale-up system resulting in unnecessary communication and memory inefficiencies.

查看原文本刊更多论文

spark:为大内存机器和分析优化spark

考虑到可负担得起的扩展服务器的可用性越来越高，我们的目标是将扩展服务器上内存处理的性能优势带给越来越常见的数据分析应用程序，这些应用程序处理中小型数据集(最多100gb)，这些数据集可以很容易地适应典型的扩展服务器的内存。为了实现这一目标，我们利用了Spark，这是一个现有的以内存为中心的数据分析框架，在数据科学家中得到了广泛的采用。将Spark的数据分析功能应用到扩展系统中需要重新考虑最初的设计假设，尽管这些假设对于扩展系统是有效的，但对于扩展系统来说却不太合适，从而导致不必要的通信和内存效率低下。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Proceedings of the 2017 Symposium on Cloud Computing

自引率

0.00%

发文量