Yifeng Geng, Xiaomeng Huang, Meiqi Zhu, Huabin Ruan, Guangwen Yang
{"title":"SciHive: Array-Based Query Processing with HiveQL","authors":"Yifeng Geng, Xiaomeng Huang, Meiqi Zhu, Huabin Ruan, Guangwen Yang","doi":"10.1109/TrustCom.2013.108","DOIUrl":null,"url":null,"abstract":"The data-intensive scientific discoveries are generating huge amounts of data at an alarming rate. Most of the data are multidimensional and stored in array-based file formats. The processing of such big data becomes an urgent challenge. In this paper, we present SciHive, a scalable and easy-to-use array-based query system. SciHive enables scientists to process raw array datasets in parallel with a SQL-like query language. We implement SciHive as an extension of Hive which is a data warehouse system on Hadoop. SciHive maps the arrays in NetCDF files to a table and executes the queries via MapReduce. Files are loaded dynamically as needed. So SciHive does not need any additional pre-loading or format conversion procedure. In addition, SciHive includes two optimization methods to reduce the generated rows. Experiments with different queries on representative datasets show that the optimizations are very effective in most cases and SciHive is scalable to handle large datasets.","PeriodicalId":206739,"journal":{"name":"2013 12th IEEE International Conference on Trust, Security and Privacy in Computing and Communications","volume":"6 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2013-07-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"20","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2013 12th IEEE International Conference on Trust, Security and Privacy in Computing and Communications","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/TrustCom.2013.108","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 20
Abstract
The data-intensive scientific discoveries are generating huge amounts of data at an alarming rate. Most of the data are multidimensional and stored in array-based file formats. The processing of such big data becomes an urgent challenge. In this paper, we present SciHive, a scalable and easy-to-use array-based query system. SciHive enables scientists to process raw array datasets in parallel with a SQL-like query language. We implement SciHive as an extension of Hive which is a data warehouse system on Hadoop. SciHive maps the arrays in NetCDF files to a table and executes the queries via MapReduce. Files are loaded dynamically as needed. So SciHive does not need any additional pre-loading or format conversion procedure. In addition, SciHive includes two optimization methods to reduce the generated rows. Experiments with different queries on representative datasets show that the optimizations are very effective in most cases and SciHive is scalable to handle large datasets.