火枪手:在数据处理系统中，人人为我，我为人人

Proceedings of the Tenth European Conference on Computer Systems Pub Date : 2015-04-17 DOI:10.1145/2741948.2741968

Ionel Gog, Malte Schwarzkopf, Natacha Crooks, Matthew P. Grosvenor, Allen Clement, S. Hand

{"title":"火枪手:在数据处理系统中，人人为我，我为人人","authors":"Ionel Gog, Malte Schwarzkopf, Natacha Crooks, Matthew P. Grosvenor, Allen Clement, S. Hand","doi":"10.1145/2741948.2741968","DOIUrl":null,"url":null,"abstract":"Many systems for the parallel processing of big data are available today. Yet, few users can tell by intuition which system, or combination of systems, is \"best\" for a given workflow. Porting workflows between systems is tedious. Hence, users become \"locked in\", despite faster or more efficient systems being available. This is a direct consequence of the tight coupling between user-facing front-ends that express workflows (e.g., Hive, SparkSQL, Lindi, GraphLINQ) and the back-end execution engines that run them (e.g., MapReduce, Spark, PowerGraph, Naiad). We argue that the ways that workflows are defined should be decoupled from the manner in which they are executed. To explore this idea, we have built Musketeer, a workflow manager which can dynamically map front-end workflow descriptions to a broad range of back-end execution engines. Our prototype maps workflows expressed in four high-level query languages to seven different popular data processing systems. Musketeer speeds up realistic workflows by up to 9x by targeting different execution engines, without requiring any manual effort. Its automatically generated back-end code comes within 5%--30% of the performance of hand-optimized implementations.","PeriodicalId":119291,"journal":{"name":"Proceedings of the Tenth European Conference on Computer Systems","volume":"35 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2015-04-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"108","resultStr":"{\"title\":\"Musketeer: all for one, one for all in data processing systems\",\"authors\":\"Ionel Gog, Malte Schwarzkopf, Natacha Crooks, Matthew P. Grosvenor, Allen Clement, S. Hand\",\"doi\":\"10.1145/2741948.2741968\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Many systems for the parallel processing of big data are available today. Yet, few users can tell by intuition which system, or combination of systems, is \\\"best\\\" for a given workflow. Porting workflows between systems is tedious. Hence, users become \\\"locked in\\\", despite faster or more efficient systems being available. This is a direct consequence of the tight coupling between user-facing front-ends that express workflows (e.g., Hive, SparkSQL, Lindi, GraphLINQ) and the back-end execution engines that run them (e.g., MapReduce, Spark, PowerGraph, Naiad). We argue that the ways that workflows are defined should be decoupled from the manner in which they are executed. To explore this idea, we have built Musketeer, a workflow manager which can dynamically map front-end workflow descriptions to a broad range of back-end execution engines. Our prototype maps workflows expressed in four high-level query languages to seven different popular data processing systems. Musketeer speeds up realistic workflows by up to 9x by targeting different execution engines, without requiring any manual effort. Its automatically generated back-end code comes within 5%--30% of the performance of hand-optimized implementations.\",\"PeriodicalId\":119291,\"journal\":{\"name\":\"Proceedings of the Tenth European Conference on Computer Systems\",\"volume\":\"35 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2015-04-17\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"108\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the Tenth European Conference on Computer Systems\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/2741948.2741968\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the Tenth European Conference on Computer Systems","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/2741948.2741968","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 108

摘要

现在有很多并行处理大数据的系统。然而，很少有用户能够凭直觉判断哪个系统或系统组合对于给定的工作流是“最好的”。在系统之间移植工作流是乏味的。因此，尽管有更快或更高效的系统可用，用户还是被“锁定”了。这是表达工作流的面向用户的前端(例如Hive, SparkSQL, Lindi, GraphLINQ)和运行它们的后端执行引擎(例如MapReduce, Spark, PowerGraph, Naiad)之间紧密耦合的直接结果。我们认为工作流定义的方式应该与工作流执行的方式分离。为了探索这个想法，我们构建了一个工作流管理器Musketeer，它可以动态地将前端工作流描述映射到广泛的后端执行引擎。我们的原型将用四种高级查询语言表示的工作流映射到七种不同的流行数据处理系统。通过针对不同的执行引擎，Musketeer将实际工作流程的速度提高了9倍，而无需任何手动工作。它自动生成的后端代码的性能只有手工优化实现的5%- 30%。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Musketeer: all for one, one for all in data processing systems

Many systems for the parallel processing of big data are available today. Yet, few users can tell by intuition which system, or combination of systems, is "best" for a given workflow. Porting workflows between systems is tedious. Hence, users become "locked in", despite faster or more efficient systems being available. This is a direct consequence of the tight coupling between user-facing front-ends that express workflows (e.g., Hive, SparkSQL, Lindi, GraphLINQ) and the back-end execution engines that run them (e.g., MapReduce, Spark, PowerGraph, Naiad). We argue that the ways that workflows are defined should be decoupled from the manner in which they are executed. To explore this idea, we have built Musketeer, a workflow manager which can dynamically map front-end workflow descriptions to a broad range of back-end execution engines. Our prototype maps workflows expressed in four high-level query languages to seven different popular data processing systems. Musketeer speeds up realistic workflows by up to 9x by targeting different execution engines, without requiring any manual effort. Its automatically generated back-end code comes within 5%--30% of the performance of hand-optimized implementations.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Proceedings of the Tenth European Conference on Computer Systems

自引率

0.00%

发文量