V. Borkar, Yingyi Bu, E. Carman, Nicola Onose, T. Westmann, Pouria Pirzadeh, M. Carey, V. Tsotras
{"title":"Algebricks: a data model-agnostic compiler backend for big data languages","authors":"V. Borkar, Yingyi Bu, E. Carman, Nicola Onose, T. Westmann, Pouria Pirzadeh, M. Carey, V. Tsotras","doi":"10.1145/2806777.2806941","DOIUrl":null,"url":null,"abstract":"A number of high-level query languages, such as Hive, Pig, Flume, and Jaql, have been developed in recent years to increase analyst productivity when processing and analyzing very large datasets. The implementation of each of these languages includes a complete, data model-dependent query compiler, yet each involves a number of similar optimizations. In this work, we describe a new query compiler architecture that separates language-specific and data model-dependent aspects from a more general query compiler backend that can generate executable data-parallel programs for shared-nothing clusters and can be used to develop multiple languages with different data models. We have built such a data model-agnostic query compiler substrate, called Algebricks, and have used it to implement three different query languages --- HiveQL, AQL, and XQuery --- to validate the efficacy of this approach. Experiments show that all three query languages benefit from the parallelization and optimization that Algebricks provides and thus have good parallel speedup and scaleup characteristics for large datasets.","PeriodicalId":275158,"journal":{"name":"Proceedings of the Sixth ACM Symposium on Cloud Computing","volume":"7 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2015-08-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"26","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the Sixth ACM Symposium on Cloud Computing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/2806777.2806941","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 26
Abstract
A number of high-level query languages, such as Hive, Pig, Flume, and Jaql, have been developed in recent years to increase analyst productivity when processing and analyzing very large datasets. The implementation of each of these languages includes a complete, data model-dependent query compiler, yet each involves a number of similar optimizations. In this work, we describe a new query compiler architecture that separates language-specific and data model-dependent aspects from a more general query compiler backend that can generate executable data-parallel programs for shared-nothing clusters and can be used to develop multiple languages with different data models. We have built such a data model-agnostic query compiler substrate, called Algebricks, and have used it to implement three different query languages --- HiveQL, AQL, and XQuery --- to validate the efficacy of this approach. Experiments show that all three query languages benefit from the parallelization and optimization that Algebricks provides and thus have good parallel speedup and scaleup characteristics for large datasets.