Ionel Gog, Malte Schwarzkopf, Natacha Crooks, Matthew P. Grosvenor, Allen Clement, S. Hand
{"title":"火枪手:在数据处理系统中,人人为我,我为人人","authors":"Ionel Gog, Malte Schwarzkopf, Natacha Crooks, Matthew P. Grosvenor, Allen Clement, S. Hand","doi":"10.1145/2741948.2741968","DOIUrl":null,"url":null,"abstract":"Many systems for the parallel processing of big data are available today. Yet, few users can tell by intuition which system, or combination of systems, is \"best\" for a given workflow. Porting workflows between systems is tedious. Hence, users become \"locked in\", despite faster or more efficient systems being available. This is a direct consequence of the tight coupling between user-facing front-ends that express workflows (e.g., Hive, SparkSQL, Lindi, GraphLINQ) and the back-end execution engines that run them (e.g., MapReduce, Spark, PowerGraph, Naiad). We argue that the ways that workflows are defined should be decoupled from the manner in which they are executed. To explore this idea, we have built Musketeer, a workflow manager which can dynamically map front-end workflow descriptions to a broad range of back-end execution engines. Our prototype maps workflows expressed in four high-level query languages to seven different popular data processing systems. Musketeer speeds up realistic workflows by up to 9x by targeting different execution engines, without requiring any manual effort. Its automatically generated back-end code comes within 5%--30% of the performance of hand-optimized implementations.","PeriodicalId":119291,"journal":{"name":"Proceedings of the Tenth European Conference on Computer Systems","volume":"35 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2015-04-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"108","resultStr":"{\"title\":\"Musketeer: all for one, one for all in data processing systems\",\"authors\":\"Ionel Gog, Malte Schwarzkopf, Natacha Crooks, Matthew P. Grosvenor, Allen Clement, S. Hand\",\"doi\":\"10.1145/2741948.2741968\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Many systems for the parallel processing of big data are available today. Yet, few users can tell by intuition which system, or combination of systems, is \\\"best\\\" for a given workflow. Porting workflows between systems is tedious. Hence, users become \\\"locked in\\\", despite faster or more efficient systems being available. This is a direct consequence of the tight coupling between user-facing front-ends that express workflows (e.g., Hive, SparkSQL, Lindi, GraphLINQ) and the back-end execution engines that run them (e.g., MapReduce, Spark, PowerGraph, Naiad). We argue that the ways that workflows are defined should be decoupled from the manner in which they are executed. To explore this idea, we have built Musketeer, a workflow manager which can dynamically map front-end workflow descriptions to a broad range of back-end execution engines. Our prototype maps workflows expressed in four high-level query languages to seven different popular data processing systems. Musketeer speeds up realistic workflows by up to 9x by targeting different execution engines, without requiring any manual effort. Its automatically generated back-end code comes within 5%--30% of the performance of hand-optimized implementations.\",\"PeriodicalId\":119291,\"journal\":{\"name\":\"Proceedings of the Tenth European Conference on Computer Systems\",\"volume\":\"35 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2015-04-17\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"108\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the Tenth European Conference on Computer Systems\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/2741948.2741968\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the Tenth European Conference on Computer Systems","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/2741948.2741968","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Musketeer: all for one, one for all in data processing systems
Many systems for the parallel processing of big data are available today. Yet, few users can tell by intuition which system, or combination of systems, is "best" for a given workflow. Porting workflows between systems is tedious. Hence, users become "locked in", despite faster or more efficient systems being available. This is a direct consequence of the tight coupling between user-facing front-ends that express workflows (e.g., Hive, SparkSQL, Lindi, GraphLINQ) and the back-end execution engines that run them (e.g., MapReduce, Spark, PowerGraph, Naiad). We argue that the ways that workflows are defined should be decoupled from the manner in which they are executed. To explore this idea, we have built Musketeer, a workflow manager which can dynamically map front-end workflow descriptions to a broad range of back-end execution engines. Our prototype maps workflows expressed in four high-level query languages to seven different popular data processing systems. Musketeer speeds up realistic workflows by up to 9x by targeting different execution engines, without requiring any manual effort. Its automatically generated back-end code comes within 5%--30% of the performance of hand-optimized implementations.