Big-Data in Climate Change Models — A Novel Approach with Hadoop MapReduce

2017 International Conference on High Performance Computing & Simulation (HPCS) Pub Date : 2017-07-01 DOI:10.1109/HPCS.2017.17

J. C. Loaiza, G. Giuliani, G. Fiameni

引用次数: 2

Abstract

The goal of this work is to present a software package which is able to process binary climate data through spawning Map-Reduce tasks while introducing minimum computational overhead and without modifying existing application code. The package is formed by the combination of two tools, Pipistrello, a Java utility that allows users to execute Map-Reduce tasks over any kind of binary file, Tina a lightweight Python library that building on top of Pipistrello is able to process scientific dataset, including NetCDF files. We benchmarked the combination of this two tools using a test Apache Hadoop Cluster (4 nodes) and a “relatively” small data set (200 GB), obtaining encouraging results. When using larger clusters and larger storage space, Tina and Pipistrello should be able to scale-up and analyse hundreds of Terabytes of scientific data in a faster, easier and efficient way.

查看原文本刊更多论文

气候变化模型中的大数据——Hadoop MapReduce的一种新方法

这项工作的目标是提出一个软件包，该软件包能够通过生成Map-Reduce任务来处理二进制气候数据，同时引入最小的计算开销，并且无需修改现有的应用程序代码。该软件包由两个工具组合而成，Pipistrello是一个Java实用程序，允许用户在任何类型的二进制文件上执行Map-Reduce任务，Tina是一个轻量级的Python库，建立在Pipistrello之上，能够处理科学数据集，包括NetCDF文件。我们使用一个测试Apache Hadoop集群(4个节点)和一个“相对”较小的数据集(200 GB)对这两个工具的组合进行基准测试，获得了令人鼓舞的结果。当使用更大的集群和更大的存储空间时，Tina和Pipistrello应该能够以更快、更容易和更有效的方式扩展和分析数百tb的科学数据。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2017 International Conference on High Performance Computing & Simulation (HPCS)

自引率

0.00%

发文量