Presto: distributed machine learning and graph processing with sparse matrices

Proceedings of the Eleventh European Conference on Computer Systems Pub Date : 2013-04-15 DOI:10.1145/2465351.2465371

S. Venkataraman, Erik Bodzsar, Indrajit Roy, Alvin AuYoung, R. Schreiber

引用次数: 100

Abstract

It is cumbersome to write machine learning and graph algorithms in data-parallel models such as MapReduce and Dryad. We observe that these algorithms are based on matrix computations and, hence, are inefficient to implement with the restrictive programming and communication interface of such frameworks. In this paper we show that array-based languages such as R [3] are suitable for implementing complex algorithms and can outperform current data parallel solutions. Since R is single-threaded and does not scale to large datasets, we have built Presto, a distributed system that extends R and addresses many of its limitations. Presto efficiently shares sparse structured data, can leverage multi-cores, and dynamically partitions data to mitigate load imbalance. Our results show the promise of this approach: many important machine learning and graph algorithms can be expressed in a single framework and are substantially faster than those in Hadoop and Spark.

查看原文本刊更多论文

Presto:分布式机器学习和稀疏矩阵的图处理

在数据并行模型(如MapReduce和Dryad)中编写机器学习和图形算法是很麻烦的。我们观察到这些算法是基于矩阵计算的，因此，用这些框架的限制性编程和通信接口来实现是低效的。在本文中，我们展示了基于数组的语言，如r[3]，适用于实现复杂的算法，并且可以优于当前的数据并行解决方案。由于R是单线程的，不能扩展到大型数据集，我们构建了Presto，一个扩展R并解决其许多限制的分布式系统。Presto有效地共享稀疏结构化数据，可以利用多核，并动态分区数据以减轻负载不平衡。我们的结果显示了这种方法的前景:许多重要的机器学习和图形算法可以在一个框架中表达，并且比Hadoop和Spark中的算法要快得多。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Proceedings of the Eleventh European Conference on Computer Systems

自引率

0.00%

发文量