Distributed aggregation for data-parallel computing: interfaces and implementations

Proceedings of the Twenty-Third ACM Symposium on Operating Systems Principles Pub Date : 2009-10-11 DOI:10.1145/1629575.1629600

Yuan Yu, P. Gunda, M. Isard

{"title":"Distributed aggregation for data-parallel computing: interfaces and implementations","authors":"Yuan Yu, P. Gunda, M. Isard","doi":"10.1145/1629575.1629600","DOIUrl":null,"url":null,"abstract":"Data-intensive applications are increasingly designed to execute on large computing clusters. Grouped aggregation is a core primitive of many distributed programming models, and it is often the most efficient available mechanism for computations such as matrix multiplication and graph traversal. Such algorithms typically require non-standard aggregations that are more sophisticated than traditional built-in database functions such as Sum and Max. As a result, the ease of programming user-defined aggregations, and the efficiency of their implementation, is of great current interest.\n This paper evaluates the interfaces and implementations for user-defined aggregation in several state of the art distributed computing systems: Hadoop, databases such as Oracle Parallel Server, and DryadLINQ. We show that: the degree of language integration between user-defined functions and the high-level query language has an impact on code legibility and simplicity; the choice of programming interface has a material effect on the performance of computations; some execution plans perform better than others on average; and that in order to get good performance on a variety of workloads a system must be able to select between execution plans depending on the computation. The interface and execution plan described in the MapReduce paper, and implemented by Hadoop, are found to be among the worst-performing choices.","PeriodicalId":20672,"journal":{"name":"Proceedings of the Twenty-Third ACM Symposium on Operating Systems Principles","volume":"454 1","pages":"247-260"},"PeriodicalIF":0.0000,"publicationDate":"2009-10-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"197","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the Twenty-Third ACM Symposium on Operating Systems Principles","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/1629575.1629600","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 197

Abstract

Data-intensive applications are increasingly designed to execute on large computing clusters. Grouped aggregation is a core primitive of many distributed programming models, and it is often the most efficient available mechanism for computations such as matrix multiplication and graph traversal. Such algorithms typically require non-standard aggregations that are more sophisticated than traditional built-in database functions such as Sum and Max. As a result, the ease of programming user-defined aggregations, and the efficiency of their implementation, is of great current interest. This paper evaluates the interfaces and implementations for user-defined aggregation in several state of the art distributed computing systems: Hadoop, databases such as Oracle Parallel Server, and DryadLINQ. We show that: the degree of language integration between user-defined functions and the high-level query language has an impact on code legibility and simplicity; the choice of programming interface has a material effect on the performance of computations; some execution plans perform better than others on average; and that in order to get good performance on a variety of workloads a system must be able to select between execution plans depending on the computation. The interface and execution plan described in the MapReduce paper, and implemented by Hadoop, are found to be among the worst-performing choices.

查看原文本刊更多论文

用于数据并行计算的分布式聚合:接口和实现

数据密集型应用程序越来越多地被设计为在大型计算集群上执行。分组聚合是许多分布式编程模型的核心原语，它通常是矩阵乘法和图遍历等计算的最有效的可用机制。这种算法通常需要非标准的聚合，这些聚合比传统的内置数据库函数(如Sum和Max)更复杂。因此，编程用户定义聚合的便利性及其实现的效率是当前的一大关注点。本文评估了几个最先进的分布式计算系统中用户定义聚合的接口和实现:Hadoop、Oracle Parallel Server等数据库和DryadLINQ。我们表明:用户定义函数和高级查询语言之间的语言集成程度对代码的易读性和简单性有影响;编程接口的选择对计算性能有重要影响;有些执行计划比其他执行计划平均执行得更好;为了在各种工作负载上获得良好的性能，系统必须能够根据计算选择不同的执行计划。MapReduce论文中描述的由Hadoop实现的接口和执行计划被认为是性能最差的选择之一。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Proceedings of the Twenty-Third ACM Symposium on Operating Systems Principles

自引率

0.00%

发文量