FlashR:使用ssd对机器学习进行并行化和缩放

Proceedings of the 23rd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming Pub Date : 2018-02-10 DOI:10.1145/3178487.3178501

Da Zheng, Disa Mhembere, J. Vogelstein, C. Priebe, R. Burns

{"title":"FlashR:使用ssd对机器学习进行并行化和缩放","authors":"Da Zheng, Disa Mhembere, J. Vogelstein, C. Priebe, R. Burns","doi":"10.1145/3178487.3178501","DOIUrl":null,"url":null,"abstract":"R is one of the most popular programming languages for statistics and machine learning, but it is slow and unable to scale to large datasets. The general approach for having an efficient algorithm in R is to implement it in C or FORTRAN and provide an R wrapper. FlashR accelerates and scales existing R code by parallelizing a large number of matrix functions in the R base package and scaling them beyond memory capacity with solid-state drives (SSDs). FlashR performs memory hierarchy aware execution to speed up parallelized R code by (i) evaluating matrix operations lazily, (ii) performing all operations in a DAG in a single execution and with only one pass over data to increase the ratio of computation to I/O, (iii) performing two levels of matrix partitioning and reordering computation on matrix partitions to reduce data movement in the memory hierarchy. We evaluate FlashR on various machine learning and statistics algorithms on inputs of up to four billion data points. Despite the huge performance gap between SSDs and RAM, FlashR on SSDs closely tracks the performance of FlashR in memory for many algorithms. The R implementations in FlashR outperforms H2O and Spark MLlib by a factor of 3 -- 20.","PeriodicalId":193776,"journal":{"name":"Proceedings of the 23rd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming","volume":"108 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2018-02-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"4","resultStr":"{\"title\":\"FlashR: parallelize and scale R for machine learning using SSDs\",\"authors\":\"Da Zheng, Disa Mhembere, J. Vogelstein, C. Priebe, R. Burns\",\"doi\":\"10.1145/3178487.3178501\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"R is one of the most popular programming languages for statistics and machine learning, but it is slow and unable to scale to large datasets. The general approach for having an efficient algorithm in R is to implement it in C or FORTRAN and provide an R wrapper. FlashR accelerates and scales existing R code by parallelizing a large number of matrix functions in the R base package and scaling them beyond memory capacity with solid-state drives (SSDs). FlashR performs memory hierarchy aware execution to speed up parallelized R code by (i) evaluating matrix operations lazily, (ii) performing all operations in a DAG in a single execution and with only one pass over data to increase the ratio of computation to I/O, (iii) performing two levels of matrix partitioning and reordering computation on matrix partitions to reduce data movement in the memory hierarchy. We evaluate FlashR on various machine learning and statistics algorithms on inputs of up to four billion data points. Despite the huge performance gap between SSDs and RAM, FlashR on SSDs closely tracks the performance of FlashR in memory for many algorithms. The R implementations in FlashR outperforms H2O and Spark MLlib by a factor of 3 -- 20.\",\"PeriodicalId\":193776,\"journal\":{\"name\":\"Proceedings of the 23rd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming\",\"volume\":\"108 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2018-02-10\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"4\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the 23rd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3178487.3178501\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 23rd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3178487.3178501","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 4

摘要

R是统计和机器学习领域最流行的编程语言之一，但它很慢，无法扩展到大型数据集。在R中拥有高效算法的一般方法是用C或FORTRAN实现它，并提供R包装器。FlashR通过并行化R基本包中的大量矩阵函数来加速和扩展现有的R代码，并通过固态硬盘(ssd)将其扩展到内存容量之外。FlashR执行内存层次结构感知执行，通过(i)延迟评估矩阵操作来加速并行R代码，(ii)在单个执行中执行DAG中的所有操作，并且只有一次传递数据以增加计算与i /O的比率，(iii)执行两级矩阵分区并在矩阵分区上重新排序计算以减少内存层次结构中的数据移动。我们在多达40亿个数据点的输入上对FlashR进行了各种机器学习和统计算法的评估。尽管ssd和RAM之间存在巨大的性能差距，但对于许多算法，ssd上的FlashR密切跟踪内存中的FlashR的性能。FlashR中的R实现比H2O和Spark MLlib的性能高出3 - 20倍。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

FlashR: parallelize and scale R for machine learning using SSDs

R is one of the most popular programming languages for statistics and machine learning, but it is slow and unable to scale to large datasets. The general approach for having an efficient algorithm in R is to implement it in C or FORTRAN and provide an R wrapper. FlashR accelerates and scales existing R code by parallelizing a large number of matrix functions in the R base package and scaling them beyond memory capacity with solid-state drives (SSDs). FlashR performs memory hierarchy aware execution to speed up parallelized R code by (i) evaluating matrix operations lazily, (ii) performing all operations in a DAG in a single execution and with only one pass over data to increase the ratio of computation to I/O, (iii) performing two levels of matrix partitioning and reordering computation on matrix partitions to reduce data movement in the memory hierarchy. We evaluate FlashR on various machine learning and statistics algorithms on inputs of up to four billion data points. Despite the huge performance gap between SSDs and RAM, FlashR on SSDs closely tracks the performance of FlashR in memory for many algorithms. The R implementations in FlashR outperforms H2O and Spark MLlib by a factor of 3 -- 20.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Proceedings of the 23rd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming

自引率

0.00%

发文量