Lamda-Flow: Automatic Pushdown of Dataflow Operators Close to the Data

2019 19th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID) Pub Date : 2019-05-14 DOI:10.1109/CCGRID.2019.00022

Raúl Gracia Tinedo, Marc Sánchez Artigas, P. López, Y. Moatti, Filip Gluszak

{"title":"Lamda-Flow: Automatic Pushdown of Dataflow Operators Close to the Data","authors":"Raúl Gracia Tinedo, Marc Sánchez Artigas, P. López, Y. Moatti, Filip Gluszak","doi":"10.1109/CCGRID.2019.00022","DOIUrl":null,"url":null,"abstract":"Modern data analytics infrastructures are composed of physically disaggregated compute and storage clusters. Thus, dataflow analytics engines, such as Apache Spark or Flink, are left with no choice but to transfer datasets to the compute cluster prior to their actual processing. For large data volumes, this becomes problematic, since it involves massive data transfers that exhaust network bandwidth, that waste compute cluster memory, and that may become a performance barrier. To overcome this problem, we present λFlow: a framework for automatically pushing dataflow operators (e.g., map, flatMap, filter, etc.) down onto the storage layer. The novelty of λFlow is that it manages the pushdown granularity at the operator level, which makes it a unique problem. To wit, it requires addressing several challenges, such as how to encapsulate dataflow operators and execute them on the storage cluster, and how to keep track of dependencies such that operators can be pushed down safely onto the storage layer. Our evaluation reports significant reductions in resource usage for a large variety of IO-bound jobs. For instance, λFlow was able to reduce both network bandwidth and memory requirements by 90% in Spark. Our Flink experiments also prove the extensibility of λFlow to other engines.","PeriodicalId":234571,"journal":{"name":"2019 19th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID)","volume":"37 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2019-05-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"3","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2019 19th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/CCGRID.2019.00022","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 3

Abstract

Modern data analytics infrastructures are composed of physically disaggregated compute and storage clusters. Thus, dataflow analytics engines, such as Apache Spark or Flink, are left with no choice but to transfer datasets to the compute cluster prior to their actual processing. For large data volumes, this becomes problematic, since it involves massive data transfers that exhaust network bandwidth, that waste compute cluster memory, and that may become a performance barrier. To overcome this problem, we present λFlow: a framework for automatically pushing dataflow operators (e.g., map, flatMap, filter, etc.) down onto the storage layer. The novelty of λFlow is that it manages the pushdown granularity at the operator level, which makes it a unique problem. To wit, it requires addressing several challenges, such as how to encapsulate dataflow operators and execute them on the storage cluster, and how to keep track of dependencies such that operators can be pushed down safely onto the storage layer. Our evaluation reports significant reductions in resource usage for a large variety of IO-bound jobs. For instance, λFlow was able to reduce both network bandwidth and memory requirements by 90% in Spark. Our Flink experiments also prove the extensibility of λFlow to other engines.

查看原文本刊更多论文

Lamda-Flow:靠近数据的数据流操作符的自动下推

现代数据分析基础设施由物理上分解的计算和存储集群组成。因此，数据流分析引擎(如Apache Spark或Flink)别无选择，只能在实际处理数据集之前将其传输到计算集群。对于大数据量，这就有问题了，因为它涉及大量数据传输，这会耗尽网络带宽，浪费计算集群内存，并可能成为性能障碍。为了克服这个问题，我们提出了λFlow:一个自动将数据流操作符(例如，map, flatMap, filter等)下推到存储层的框架。λFlow的新颖之处在于它在操作符级别管理下推粒度，这使得它成为一个独特的问题。也就是说，它需要解决几个挑战，例如如何封装数据流操作符并在存储集群上执行它们，以及如何跟踪依赖项，以便将操作符安全地推入存储层。我们的评估报告显示，大量与io相关的作业的资源使用显著减少。例如，在Spark中，λFlow能够将网络带宽和内存需求降低90%。我们的Flink实验也证明了λFlow对其他引擎的可扩展性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2019 19th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID)

自引率

0.00%

发文量