SparkR: Scaling R Programs with Spark

Proceedings of the 2016 International Conference on Management of Data Pub Date : 2016-06-14 DOI:10.1145/2882903.2903740

S. Venkataraman, Zongheng Yang, Davies Liu, Eric Liang, H. Falaki, Xiangrui Meng, Reynold Xin, A. Ghodsi, M. Franklin, I. Stoica, M. Zaharia

引用次数: 67

Abstract

R is a popular statistical programming language with a number of extensions that support data processing and machine learning tasks. However, interactive data analysis in R is usually limited as the R runtime is single threaded and can only process data sets that fit in a single machine's memory. We present SparkR, an R package that provides a frontend to Apache Spark and uses Spark's distributed computation engine to enable large scale data analysis from the R shell. We describe the main design goals of SparkR, discuss how the high-level DataFrame API enables scalable computation and present some of the key details of our implementation.

查看原文本刊更多论文

SparkR:用Spark扩展R程序

R是一种流行的统计编程语言，具有许多支持数据处理和机器学习任务的扩展。然而，R中的交互式数据分析通常是有限的，因为R运行时是单线程的，只能处理适合单个机器内存的数据集。我们介绍了SparkR，一个R包，它为Apache Spark提供了一个前端，并使用Spark的分布式计算引擎从R shell中实现大规模数据分析。我们描述了SparkR的主要设计目标，讨论了高级DataFrame API如何支持可伸缩计算，并介绍了我们实现的一些关键细节。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Proceedings of the 2016 International Conference on Management of Data

自引率

0.00%

发文量