Kaleido:在大数据系统中实现高效的科学数据处理

2017 International Conference on Networking, Architecture, and Storage (NAS) Pub Date : 2017-08-01 DOI:10.1109/NAS.2017.8026864

Saman Biookaghazadeh, Shujia Zhou, Ming Zhao

{"title":"Kaleido:在大数据系统中实现高效的科学数据处理","authors":"Saman Biookaghazadeh, Shujia Zhou, Ming Zhao","doi":"10.1109/NAS.2017.8026864","DOIUrl":null,"url":null,"abstract":"Big-Data systems are increasingly important for solving the data-driven problems in many science domains including geosciences. However, existing big- data systems cannot support the efficient processing of self-describing data formats such as NetCDF which are commonly used by scientific communities for data distribution and sharing. This limitation presents a serious hurdle to the further adoption of big-data systems by science domains. This paper presents Kaleido, a solution to this problem by enabling big- data systems to efficiently store and process scientific data. Specifically, it enables Hadoop to directly store NetCDF data on HDFS, and process them in MapReduce using convenient APIs. It also enables Hive to support queries on NetCDF data, transparent to the users. Moreover, it employs optimizations tailored to scientific data, particularly dimension-aware layout which allows efficient execution of subset queries targeting any dimension of the multi- dimensional data. The paper presents a comprehensive evaluation of Kaleido using representative queries on a typical geoscientific dataset. The results show that Kaleido achieves substantial speedup and space saving compared to existing solutions for storing and processing NetCDF data on Hadoop, and it also substantially outperforms the state-of-the-art solutions for supporting subset queries on scientific data.","PeriodicalId":222161,"journal":{"name":"2017 International Conference on Networking, Architecture, and Storage (NAS)","volume":"24 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2017-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"4","resultStr":"{\"title\":\"Kaleido: Enabling Efficient Scientific Data Processing on Big-Data Systems\",\"authors\":\"Saman Biookaghazadeh, Shujia Zhou, Ming Zhao\",\"doi\":\"10.1109/NAS.2017.8026864\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Big-Data systems are increasingly important for solving the data-driven problems in many science domains including geosciences. However, existing big- data systems cannot support the efficient processing of self-describing data formats such as NetCDF which are commonly used by scientific communities for data distribution and sharing. This limitation presents a serious hurdle to the further adoption of big-data systems by science domains. This paper presents Kaleido, a solution to this problem by enabling big- data systems to efficiently store and process scientific data. Specifically, it enables Hadoop to directly store NetCDF data on HDFS, and process them in MapReduce using convenient APIs. It also enables Hive to support queries on NetCDF data, transparent to the users. Moreover, it employs optimizations tailored to scientific data, particularly dimension-aware layout which allows efficient execution of subset queries targeting any dimension of the multi- dimensional data. The paper presents a comprehensive evaluation of Kaleido using representative queries on a typical geoscientific dataset. The results show that Kaleido achieves substantial speedup and space saving compared to existing solutions for storing and processing NetCDF data on Hadoop, and it also substantially outperforms the state-of-the-art solutions for supporting subset queries on scientific data.\",\"PeriodicalId\":222161,\"journal\":{\"name\":\"2017 International Conference on Networking, Architecture, and Storage (NAS)\",\"volume\":\"24 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2017-08-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"4\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2017 International Conference on Networking, Architecture, and Storage (NAS)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/NAS.2017.8026864\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2017 International Conference on Networking, Architecture, and Storage (NAS)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/NAS.2017.8026864","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 4

摘要

大数据系统对于解决包括地球科学在内的许多科学领域的数据驱动问题越来越重要。然而，现有的大数据系统不能支持像NetCDF这样的自描述数据格式的有效处理，而NetCDF通常被科学界用于数据分发和共享。这一限制给科学领域进一步采用大数据系统带来了严重障碍。本文介绍了Kaleido，它通过使大数据系统能够有效地存储和处理科学数据来解决这个问题。具体来说，它使Hadoop能够直接将NetCDF数据存储在HDFS上，并使用方便的api在MapReduce中处理它们。它还使Hive能够支持NetCDF数据的查询，对用户透明。此外，它采用了针对科学数据的优化，特别是维度感知布局，它允许针对多维数据的任何维度有效地执行子集查询。本文利用典型地球科学数据集的代表性查询对Kaleido进行了综合评价。结果表明，与在Hadoop上存储和处理NetCDF数据的现有解决方案相比，Kaleido实现了显著的加速和空间节省，并且在支持科学数据的子集查询方面，它也大大优于最先进的解决方案。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Kaleido: Enabling Efficient Scientific Data Processing on Big-Data Systems

Big-Data systems are increasingly important for solving the data-driven problems in many science domains including geosciences. However, existing big- data systems cannot support the efficient processing of self-describing data formats such as NetCDF which are commonly used by scientific communities for data distribution and sharing. This limitation presents a serious hurdle to the further adoption of big-data systems by science domains. This paper presents Kaleido, a solution to this problem by enabling big- data systems to efficiently store and process scientific data. Specifically, it enables Hadoop to directly store NetCDF data on HDFS, and process them in MapReduce using convenient APIs. It also enables Hive to support queries on NetCDF data, transparent to the users. Moreover, it employs optimizations tailored to scientific data, particularly dimension-aware layout which allows efficient execution of subset queries targeting any dimension of the multi- dimensional data. The paper presents a comprehensive evaluation of Kaleido using representative queries on a typical geoscientific dataset. The results show that Kaleido achieves substantial speedup and space saving compared to existing solutions for storing and processing NetCDF data on Hadoop, and it also substantially outperforms the state-of-the-art solutions for supporting subset queries on scientific data.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2017 International Conference on Networking, Architecture, and Storage (NAS)

自引率

0.00%

发文量