FDQ: Advance Analytics Over Real Scientific Array Datasets

2018 IEEE 14th International Conference on e-Science (e-Science) Pub Date : 2018-10-01 DOI:10.1109/eScience.2018.00134

Roee Ebenstein, G. Agrawal, Jiali Wang, J. Boley, R. Kettimuthu

{"title":"FDQ: Advance Analytics Over Real Scientific Array Datasets","authors":"Roee Ebenstein, G. Agrawal, Jiali Wang, J. Boley, R. Kettimuthu","doi":"10.1109/eScience.2018.00134","DOIUrl":null,"url":null,"abstract":"Scientific data is not only rapidly increasing in size, but in complexity of operations performed upon as well. Compared to the prevalent use of ad-hoc approaches, structured operators provide many benefits. In this paper, we introduce FDQ - an Analytical Functions Distributed Querying Engine intended for Array Data. Motivated by needs of climate scientists in terms of both functionality and scalability, we make three major contributions: First, we introduce a new class of analytical querying - querying over windows where the planes that construct these windows are internally ordered. An example of this querying type is the introduced MINUS analytical function, a function that supports querying over accumulative measurements with data resets. Second, we describe in detail memory management optimizations for efficient processing of analytical (and other structured operators) querying over large datasets. Last, we provide efficient methods to execute these queries in parallel, using a sectioned (tiled) approach. We evaluate our methods using real multi-dimensional climate datasets, and show they outperform existing approaches. When running locally (not in a distributed manner), we observed an average performance improvement of 538% compared to other engines for analytical calculations. We also show our methods performance improve linearly with the provided computing resources (scale up and out).","PeriodicalId":6476,"journal":{"name":"2018 IEEE 14th International Conference on e-Science (e-Science)","volume":"1 1","pages":"453-463"},"PeriodicalIF":0.0000,"publicationDate":"2018-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2018 IEEE 14th International Conference on e-Science (e-Science)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/eScience.2018.00134","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 1

Abstract

Scientific data is not only rapidly increasing in size, but in complexity of operations performed upon as well. Compared to the prevalent use of ad-hoc approaches, structured operators provide many benefits. In this paper, we introduce FDQ - an Analytical Functions Distributed Querying Engine intended for Array Data. Motivated by needs of climate scientists in terms of both functionality and scalability, we make three major contributions: First, we introduce a new class of analytical querying - querying over windows where the planes that construct these windows are internally ordered. An example of this querying type is the introduced MINUS analytical function, a function that supports querying over accumulative measurements with data resets. Second, we describe in detail memory management optimizations for efficient processing of analytical (and other structured operators) querying over large datasets. Last, we provide efficient methods to execute these queries in parallel, using a sectioned (tiled) approach. We evaluate our methods using real multi-dimensional climate datasets, and show they outperform existing approaches. When running locally (not in a distributed manner), we observed an average performance improvement of 538% compared to other engines for analytical calculations. We also show our methods performance improve linearly with the provided computing resources (scale up and out).

查看原文本刊更多论文

FDQ:基于真实科学阵列数据集的高级分析

科学数据不仅在规模上迅速增加，而且其操作的复杂性也在迅速增加。与普遍使用的特设方法相比，结构化操作符提供了许多好处。本文介绍了面向数组数据的分析函数分布式查询引擎FDQ。出于气候科学家在功能和可扩展性方面的需求，我们做出了三个主要贡献:首先，我们引入了一类新的分析查询-在构建这些窗口的平面内部有序的窗口上查询。这种查询类型的一个示例是引入的MINUS分析函数，该函数支持对具有数据重置的累积测量值进行查询。其次，我们详细描述了在大型数据集上有效处理分析(和其他结构化操作符)查询的内存管理优化。最后，我们提供了使用分段(平铺)方法并行执行这些查询的有效方法。我们使用真实的多维气候数据集来评估我们的方法，并表明它们优于现有的方法。在本地运行时(不是以分布式方式)，我们观察到与其他引擎相比，用于分析计算的平均性能提高了538%。我们还展示了我们的方法性能随着所提供的计算资源(向上和向外扩展)而线性提高。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2018 IEEE 14th International Conference on e-Science (e-Science)

自引率

0.00%

发文量