SparkR:用Spark扩展R程序

Proceedings of the 2016 International Conference on Management of Data Pub Date : 2016-06-14 DOI:10.1145/2882903.2903740

S. Venkataraman, Zongheng Yang, Davies Liu, Eric Liang, H. Falaki, Xiangrui Meng, Reynold Xin, A. Ghodsi, M. Franklin, I. Stoica, M. Zaharia

{"title":"SparkR:用Spark扩展R程序","authors":"S. Venkataraman, Zongheng Yang, Davies Liu, Eric Liang, H. Falaki, Xiangrui Meng, Reynold Xin, A. Ghodsi, M. Franklin, I. Stoica, M. Zaharia","doi":"10.1145/2882903.2903740","DOIUrl":null,"url":null,"abstract":"R is a popular statistical programming language with a number of extensions that support data processing and machine learning tasks. However, interactive data analysis in R is usually limited as the R runtime is single threaded and can only process data sets that fit in a single machine's memory. We present SparkR, an R package that provides a frontend to Apache Spark and uses Spark's distributed computation engine to enable large scale data analysis from the R shell. We describe the main design goals of SparkR, discuss how the high-level DataFrame API enables scalable computation and present some of the key details of our implementation.","PeriodicalId":20483,"journal":{"name":"Proceedings of the 2016 International Conference on Management of Data","volume":"70 8 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2016-06-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"67","resultStr":"{\"title\":\"SparkR: Scaling R Programs with Spark\",\"authors\":\"S. Venkataraman, Zongheng Yang, Davies Liu, Eric Liang, H. Falaki, Xiangrui Meng, Reynold Xin, A. Ghodsi, M. Franklin, I. Stoica, M. Zaharia\",\"doi\":\"10.1145/2882903.2903740\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"R is a popular statistical programming language with a number of extensions that support data processing and machine learning tasks. However, interactive data analysis in R is usually limited as the R runtime is single threaded and can only process data sets that fit in a single machine's memory. We present SparkR, an R package that provides a frontend to Apache Spark and uses Spark's distributed computation engine to enable large scale data analysis from the R shell. We describe the main design goals of SparkR, discuss how the high-level DataFrame API enables scalable computation and present some of the key details of our implementation.\",\"PeriodicalId\":20483,\"journal\":{\"name\":\"Proceedings of the 2016 International Conference on Management of Data\",\"volume\":\"70 8 1\",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2016-06-14\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"67\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the 2016 International Conference on Management of Data\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/2882903.2903740\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 2016 International Conference on Management of Data","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/2882903.2903740","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 67

摘要

R是一种流行的统计编程语言，具有许多支持数据处理和机器学习任务的扩展。然而，R中的交互式数据分析通常是有限的，因为R运行时是单线程的，只能处理适合单个机器内存的数据集。我们介绍了SparkR，一个R包，它为Apache Spark提供了一个前端，并使用Spark的分布式计算引擎从R shell中实现大规模数据分析。我们描述了SparkR的主要设计目标，讨论了高级DataFrame API如何支持可伸缩计算，并介绍了我们实现的一些关键细节。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

SparkR: Scaling R Programs with Spark

R is a popular statistical programming language with a number of extensions that support data processing and machine learning tasks. However, interactive data analysis in R is usually limited as the R runtime is single threaded and can only process data sets that fit in a single machine's memory. We present SparkR, an R package that provides a frontend to Apache Spark and uses Spark's distributed computation engine to enable large scale data analysis from the R shell. We describe the main design goals of SparkR, discuss how the high-level DataFrame API enables scalable computation and present some of the key details of our implementation.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Proceedings of the 2016 International Conference on Management of Data

自引率

0.00%

发文量