- Book学术

发布求助

文献互助智能选刊最新文献

Q1 Computer Science

ACM Sigplan Notices Pub Date : 2018-02-10 DOI:10.1145/3200691.3178501

Da Zheng, Disa Mhembere, J. Vogelstein, Carey E. Priebe, R. Burns

{"title":"FlashR","authors":"Da Zheng, Disa Mhembere, J. Vogelstein, Carey E. Priebe, R. Burns","doi":"10.1145/3200691.3178501","DOIUrl":null,"url":null,"abstract":"R is one of the most popular programming languages for statistics and machine learning, but it is slow and unable to scale to large datasets. The general approach for having an efficient algorithm in R is to implement it in C or FORTRAN and provide an R wrapper. FlashR accelerates and scales existing R code by parallelizing a large number of matrix functions in the R base package and scaling them beyond memory capacity with solid-state drives (SSDs). FlashR performs memory hierarchy aware execution to speed up parallelized R code by (i) evaluating matrix operations lazily, (ii) performing all operations in a DAG in a single execution and with only one pass over data to increase the ratio of computation to I/O, (iii) performing two levels of matrix partitioning and reordering computation on matrix partitions to reduce data movement in the memory hierarchy. We evaluate FlashR on various machine learning and statistics algorithms on inputs of up to four billion data points. Despite the huge performance gap between SSDs and RAM, FlashR on SSDs closely tracks the performance of FlashR in memory for many algorithms. The R implementations in FlashR outperforms H2O and Spark MLlib by a factor of 3 -- 20.","PeriodicalId":50923,"journal":{"name":"ACM Sigplan Notices","volume":"2 1","pages":"183 - 194"},"PeriodicalIF":0.0000,"publicationDate":"2018-02-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"ACM Sigplan Notices","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3200691.3178501","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"Computer Science","Score":null,"Total":0}

引用次数: 0

摘要

R是统计和机器学习领域最流行的编程语言之一，但它很慢，无法扩展到大型数据集。在R中拥有高效算法的一般方法是用C或FORTRAN实现它，并提供R包装器。FlashR通过并行化R基本包中的大量矩阵函数来加速和扩展现有的R代码，并通过固态硬盘(ssd)将其扩展到内存容量之外。FlashR执行内存层次结构感知执行，通过(i)延迟评估矩阵操作来加速并行R代码，(ii)在单个执行中执行DAG中的所有操作，并且只有一次传递数据以增加计算与i /O的比率，(iii)执行两级矩阵分区并在矩阵分区上重新排序计算以减少内存层次结构中的数据移动。我们在多达40亿个数据点的输入上对FlashR进行了各种机器学习和统计算法的评估。尽管ssd和RAM之间存在巨大的性能差距，但对于许多算法，ssd上的FlashR密切跟踪内存中的FlashR的性能。FlashR中的R实现比H2O和Spark MLlib的性能高出3 - 20倍。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

FlashR

R is one of the most popular programming languages for statistics and machine learning, but it is slow and unable to scale to large datasets. The general approach for having an efficient algorithm in R is to implement it in C or FORTRAN and provide an R wrapper. FlashR accelerates and scales existing R code by parallelizing a large number of matrix functions in the R base package and scaling them beyond memory capacity with solid-state drives (SSDs). FlashR performs memory hierarchy aware execution to speed up parallelized R code by (i) evaluating matrix operations lazily, (ii) performing all operations in a DAG in a single execution and with only one pass over data to increase the ratio of computation to I/O, (iii) performing two levels of matrix partitioning and reordering computation on matrix partitions to reduce data movement in the memory hierarchy. We evaluate FlashR on various machine learning and statistics algorithms on inputs of up to four billion data points. Despite the huge performance gap between SSDs and RAM, FlashR on SSDs closely tracks the performance of FlashR in memory for many algorithms. The R implementations in FlashR outperforms H2O and Spark MLlib by a factor of 3 -- 20.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

ACM Sigplan Notices 工程技术-计算机：软件工程

CiteScore

4.90

自引率

0.00%

发文量

审稿时长

2-4 weeks

期刊介绍： The ACM Special Interest Group on Programming Languages explores programming language concepts and tools, focusing on design, implementation, practice, and theory. Its members are programming language developers, educators, implementers, researchers, theoreticians, and users. SIGPLAN sponsors several major annual conferences, including the Symposium on Principles of Programming Languages (POPL), the Symposium on Principles and Practice of Parallel Programming (PPoPP), the Conference on Programming Language Design and Implementation (PLDI), the International Conference on Functional Programming (ICFP), the International Conference on Object-Oriented Programming, Systems, Languages, and Applications (OOPSLA), as well as more than a dozen other events of either smaller size or in-cooperation with other SIGs. The monthly "ACM SIGPLAN Notices" publishes proceedings of selected sponsored events and an annual report on SIGPLAN activities. Members receive discounts on conference registrations and free access to ACM SIGPLAN publications in the ACM Digital Library. SIGPLAN recognizes significant research and service contributions of individuals with a variety of awards, supports current members through the Professional Activities Committee, and encourages future programming language enthusiasts with frequent Programming Languages Mentoring Workshops (PLMW).