{"title":"Big-Data in Climate Change Models — A Novel Approach with Hadoop MapReduce","authors":"J. C. Loaiza, G. Giuliani, G. Fiameni","doi":"10.1109/HPCS.2017.17","DOIUrl":null,"url":null,"abstract":"The goal of this work is to present a software package which is able to process binary climate data through spawning Map-Reduce tasks while introducing minimum computational overhead and without modifying existing application code. The package is formed by the combination of two tools, Pipistrello, a Java utility that allows users to execute Map-Reduce tasks over any kind of binary file, Tina a lightweight Python library that building on top of Pipistrello is able to process scientific dataset, including NetCDF files. We benchmarked the combination of this two tools using a test Apache Hadoop Cluster (4 nodes) and a “relatively” small data set (200 GB), obtaining encouraging results. When using larger clusters and larger storage space, Tina and Pipistrello should be able to scale-up and analyse hundreds of Terabytes of scientific data in a faster, easier and efficient way.","PeriodicalId":115758,"journal":{"name":"2017 International Conference on High Performance Computing & Simulation (HPCS)","volume":"7 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2017-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2017 International Conference on High Performance Computing & Simulation (HPCS)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/HPCS.2017.17","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 2
Abstract
The goal of this work is to present a software package which is able to process binary climate data through spawning Map-Reduce tasks while introducing minimum computational overhead and without modifying existing application code. The package is formed by the combination of two tools, Pipistrello, a Java utility that allows users to execute Map-Reduce tasks over any kind of binary file, Tina a lightweight Python library that building on top of Pipistrello is able to process scientific dataset, including NetCDF files. We benchmarked the combination of this two tools using a test Apache Hadoop Cluster (4 nodes) and a “relatively” small data set (200 GB), obtaining encouraging results. When using larger clusters and larger storage space, Tina and Pipistrello should be able to scale-up and analyse hundreds of Terabytes of scientific data in a faster, easier and efficient way.