{"title":"基于R语言的大规模统计计算框架","authors":"Xia Xie, Jie Cao, Hai Jin, Xijiang Ke, Wenzhi Cao","doi":"10.1109/APSCC.2012.74","DOIUrl":null,"url":null,"abstract":"Demands for highly scalable parallel data processing platforms is raising due to an explosion in the number of massive-scale data intensive applications both in industry and in sciences. Performing statistical computing over huge data repositories poses a significant challenge to existing statistical software and computational infrastructure. After analyzing various open source computational infrastructures and their programming paradigm APIs, the results have shown that most of them are JVM based, and their APIs are given as Java interfaces or abstract classes. This paper proposes a generic framework JR Bridge, which can integrate R and JVM-based computational infrastructures by generating Java APIs code wrapper around the native R code automatically and handling type conversion. Using this framework, we build a distributed statistical computing environment by integrating R with Hadoop. With the Hadoop Distributed File System plug in, it brings a way to store and access datasets with millions of objects. With MapReduce plug in, it brings a natural environment to code MapReduce algorithms in R. The experiment result shows JR Bridge scales linearly with the size of the datasets and thus provides a scalable solution for large-scale statistical computing in R.","PeriodicalId":256842,"journal":{"name":"2012 IEEE Asia-Pacific Services Computing Conference","volume":"18 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2012-12-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":"{\"title\":\"JRBridge: A Framework of Large-Scale Statistical Computing for R\",\"authors\":\"Xia Xie, Jie Cao, Hai Jin, Xijiang Ke, Wenzhi Cao\",\"doi\":\"10.1109/APSCC.2012.74\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Demands for highly scalable parallel data processing platforms is raising due to an explosion in the number of massive-scale data intensive applications both in industry and in sciences. Performing statistical computing over huge data repositories poses a significant challenge to existing statistical software and computational infrastructure. After analyzing various open source computational infrastructures and their programming paradigm APIs, the results have shown that most of them are JVM based, and their APIs are given as Java interfaces or abstract classes. This paper proposes a generic framework JR Bridge, which can integrate R and JVM-based computational infrastructures by generating Java APIs code wrapper around the native R code automatically and handling type conversion. Using this framework, we build a distributed statistical computing environment by integrating R with Hadoop. With the Hadoop Distributed File System plug in, it brings a way to store and access datasets with millions of objects. With MapReduce plug in, it brings a natural environment to code MapReduce algorithms in R. The experiment result shows JR Bridge scales linearly with the size of the datasets and thus provides a scalable solution for large-scale statistical computing in R.\",\"PeriodicalId\":256842,\"journal\":{\"name\":\"2012 IEEE Asia-Pacific Services Computing Conference\",\"volume\":\"18 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2012-12-06\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"1\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2012 IEEE Asia-Pacific Services Computing Conference\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/APSCC.2012.74\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2012 IEEE Asia-Pacific Services Computing Conference","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/APSCC.2012.74","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
JRBridge: A Framework of Large-Scale Statistical Computing for R
Demands for highly scalable parallel data processing platforms is raising due to an explosion in the number of massive-scale data intensive applications both in industry and in sciences. Performing statistical computing over huge data repositories poses a significant challenge to existing statistical software and computational infrastructure. After analyzing various open source computational infrastructures and their programming paradigm APIs, the results have shown that most of them are JVM based, and their APIs are given as Java interfaces or abstract classes. This paper proposes a generic framework JR Bridge, which can integrate R and JVM-based computational infrastructures by generating Java APIs code wrapper around the native R code automatically and handling type conversion. Using this framework, we build a distributed statistical computing environment by integrating R with Hadoop. With the Hadoop Distributed File System plug in, it brings a way to store and access datasets with millions of objects. With MapReduce plug in, it brings a natural environment to code MapReduce algorithms in R. The experiment result shows JR Bridge scales linearly with the size of the datasets and thus provides a scalable solution for large-scale statistical computing in R.