SAGA: array storage as a DB with support for structural aggregations

Scientific and statistical database management : International Conference, SSDBM ... : proceedings. International Conference on Scientific and Statistical Database Management Pub Date : 2014-06-30 DOI:10.1145/2618243.2618270

Yi Wang, Arnab Nandi, G. Agrawal

{"title":"SAGA: array storage as a DB with support for structural aggregations","authors":"Yi Wang, Arnab Nandi, G. Agrawal","doi":"10.1145/2618243.2618270","DOIUrl":null,"url":null,"abstract":"In recent years, many Array DBMSs, including SciDB and RasDaMan have emerged to meet the needs of data management applications where the natural structures are the arrays. These systems, like their relational counterparts, involve an expensive data ingestion phase. The paradigm of using native storage as a DB and providing database-like support (e.g., the NoDB approach) has recently been shown to be an effective approach for dealing with infrequently queried data, where data ingestion costs cannot be justified, though only in context of relational data.\n Applications that generate massive arrays, such as the scientific simulations, often store the data in one of a small number of array storage formats, like NetCDF or HDF5. Thus, a natural question is, \"can database-like functionality be supported over native array storage?\". In this paper, we present algorithms, different partitioning strategies, and an analytical model for supporting structural (grid, sliding, hierarchical, and circular) aggregations over native array storage, and describe implementation of this approach in a system we refer to as Structural AGgregations over Array storage (SAGA). We show how the relative performance of different partitioning strategies changes with varying amount of computation in the aggregation function and different levels of data skew, and our model is effective in choosing the best partitioning strategy. Performance comparison with SciDB shows that despite working on native array storage, the aggregation costs with our system are lower. Finally, we also show that our structural aggregation implementations achieve high parallel efficiency.","PeriodicalId":74773,"journal":{"name":"Scientific and statistical database management : International Conference, SSDBM ... : proceedings. International Conference on Scientific and Statistical Database Management","volume":"36 1","pages":"9:1-9:12"},"PeriodicalIF":0.0000,"publicationDate":"2014-06-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"54","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Scientific and statistical database management : International Conference, SSDBM ... : proceedings. International Conference on Scientific and Statistical Database Management","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/2618243.2618270","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 54

Abstract

In recent years, many Array DBMSs, including SciDB and RasDaMan have emerged to meet the needs of data management applications where the natural structures are the arrays. These systems, like their relational counterparts, involve an expensive data ingestion phase. The paradigm of using native storage as a DB and providing database-like support (e.g., the NoDB approach) has recently been shown to be an effective approach for dealing with infrequently queried data, where data ingestion costs cannot be justified, though only in context of relational data. Applications that generate massive arrays, such as the scientific simulations, often store the data in one of a small number of array storage formats, like NetCDF or HDF5. Thus, a natural question is, "can database-like functionality be supported over native array storage?". In this paper, we present algorithms, different partitioning strategies, and an analytical model for supporting structural (grid, sliding, hierarchical, and circular) aggregations over native array storage, and describe implementation of this approach in a system we refer to as Structural AGgregations over Array storage (SAGA). We show how the relative performance of different partitioning strategies changes with varying amount of computation in the aggregation function and different levels of data skew, and our model is effective in choosing the best partitioning strategy. Performance comparison with SciDB shows that despite working on native array storage, the aggregation costs with our system are lower. Finally, we also show that our structural aggregation implementations achieve high parallel efficiency.

查看原文本刊更多论文

SAGA:作为DB的数组存储，支持结构聚合

近年来，出现了许多Array dbms，包括SciDB和RasDaMan，以满足自然结构为数组的数据管理应用程序的需求。与它们的关系系统一样，这些系统涉及一个昂贵的数据摄取阶段。使用本地存储作为数据库并提供类似数据库的支持(例如，NoDB方法)的范例最近被证明是处理不经常查询的数据的有效方法，在这种情况下，数据摄取成本无法证明是合理的，尽管只是在关系数据上下文中。生成大量数组的应用程序，如科学模拟，通常将数据存储在少数数组存储格式中的一种，如NetCDF或HDF5。因此，一个自然的问题是，“在本机数组存储上能支持类似数据库的功能吗?”在本文中，我们提出了算法，不同的分区策略，以及支持本地阵列存储上的结构(网格，滑动，分层和圆形)聚合的分析模型，并描述了这种方法在我们称为阵列存储上的结构聚合(SAGA)系统中的实现。我们展示了不同分区策略的相对性能如何随着聚合函数的计算量和数据倾斜程度的不同而变化，并且我们的模型在选择最佳分区策略方面是有效的。与SciDB的性能比较表明，尽管在本机阵列存储上工作，我们系统的聚合成本更低。最后，我们还证明了我们的结构聚合实现具有很高的并行效率。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Scientific and statistical database management : International Conference, SSDBM ... : proceedings. International Conference on Scientific and Statistical Database Management

自引率

0.00%

发文量