Lamda-Flow: Automatic Pushdown of Dataflow Operators Close to the Data

Raúl Gracia Tinedo, Marc Sánchez Artigas, P. López, Y. Moatti, Filip Gluszak
{"title":"Lamda-Flow: Automatic Pushdown of Dataflow Operators Close to the Data","authors":"Raúl Gracia Tinedo, Marc Sánchez Artigas, P. López, Y. Moatti, Filip Gluszak","doi":"10.1109/CCGRID.2019.00022","DOIUrl":null,"url":null,"abstract":"Modern data analytics infrastructures are composed of physically disaggregated compute and storage clusters. Thus, dataflow analytics engines, such as Apache Spark or Flink, are left with no choice but to transfer datasets to the compute cluster prior to their actual processing. For large data volumes, this becomes problematic, since it involves massive data transfers that exhaust network bandwidth, that waste compute cluster memory, and that may become a performance barrier. To overcome this problem, we present λFlow: a framework for automatically pushing dataflow operators (e.g., map, flatMap, filter, etc.) down onto the storage layer. The novelty of λFlow is that it manages the pushdown granularity at the operator level, which makes it a unique problem. To wit, it requires addressing several challenges, such as how to encapsulate dataflow operators and execute them on the storage cluster, and how to keep track of dependencies such that operators can be pushed down safely onto the storage layer. Our evaluation reports significant reductions in resource usage for a large variety of IO-bound jobs. For instance, λFlow was able to reduce both network bandwidth and memory requirements by 90% in Spark. Our Flink experiments also prove the extensibility of λFlow to other engines.","PeriodicalId":234571,"journal":{"name":"2019 19th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID)","volume":"37 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2019-05-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"3","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2019 19th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/CCGRID.2019.00022","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 3

Abstract

Modern data analytics infrastructures are composed of physically disaggregated compute and storage clusters. Thus, dataflow analytics engines, such as Apache Spark or Flink, are left with no choice but to transfer datasets to the compute cluster prior to their actual processing. For large data volumes, this becomes problematic, since it involves massive data transfers that exhaust network bandwidth, that waste compute cluster memory, and that may become a performance barrier. To overcome this problem, we present λFlow: a framework for automatically pushing dataflow operators (e.g., map, flatMap, filter, etc.) down onto the storage layer. The novelty of λFlow is that it manages the pushdown granularity at the operator level, which makes it a unique problem. To wit, it requires addressing several challenges, such as how to encapsulate dataflow operators and execute them on the storage cluster, and how to keep track of dependencies such that operators can be pushed down safely onto the storage layer. Our evaluation reports significant reductions in resource usage for a large variety of IO-bound jobs. For instance, λFlow was able to reduce both network bandwidth and memory requirements by 90% in Spark. Our Flink experiments also prove the extensibility of λFlow to other engines.
Lamda-Flow:靠近数据的数据流操作符的自动下推
现代数据分析基础设施由物理上分解的计算和存储集群组成。因此,数据流分析引擎(如Apache Spark或Flink)别无选择,只能在实际处理数据集之前将其传输到计算集群。对于大数据量,这就有问题了,因为它涉及大量数据传输,这会耗尽网络带宽,浪费计算集群内存,并可能成为性能障碍。为了克服这个问题,我们提出了λFlow:一个自动将数据流操作符(例如,map, flatMap, filter等)下推到存储层的框架。λFlow的新颖之处在于它在操作符级别管理下推粒度,这使得它成为一个独特的问题。也就是说,它需要解决几个挑战,例如如何封装数据流操作符并在存储集群上执行它们,以及如何跟踪依赖项,以便将操作符安全地推入存储层。我们的评估报告显示,大量与io相关的作业的资源使用显著减少。例如,在Spark中,λFlow能够将网络带宽和内存需求降低90%。我们的Flink实验也证明了λFlow对其他引擎的可扩展性。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信