Will They Blend?: Exploring Big Data Computation Atop Traditional HPC NAS Storage

2014 IEEE 34th International Conference on Distributed Computing Systems Pub Date : 2014-06-30 DOI:10.1109/ICDCS.2014.60

E. Wilson, M. Kandemir, Garth A. Gibson

{"title":"Will They Blend?: Exploring Big Data Computation Atop Traditional HPC NAS Storage","authors":"E. Wilson, M. Kandemir, Garth A. Gibson","doi":"10.1109/ICDCS.2014.60","DOIUrl":null,"url":null,"abstract":"The Apache Hadoop framework has rung in a new era in how data-rich organizations can process, store, and analyze large amounts of data. This has resulted in increased potential for an infrastructure exodus from the traditional solution of commercial database ad-hoc analytics on network-attached storage (NAS). While many data-rich organizations can afford to either move entirely to Hadoop for their Big Data analytics, or to maintain their existing traditional infrastructures and acquire a new set of infrastructure solely for Hadoop jobs, most supercomputing centers do not enjoy either of those possibilities. Too much of the existing scientific code is tailored to work on massively parallel file systems unlike the Hadoop Distributed File System (HDFS), and their datasets are too large to reasonably maintain and/or ferry between two distinct storage systems. Nevertheless, as scientists search for easier-to-program frameworks with a lower time-to-science to post-process their huge datasets after execution, there is increasing pressure to enable use of MapReduce within these traditional High Performance Computing (HPC) architectures. Therefore, in this work we explore potential means to enable use of the easy-to-program Hadoop MapReduce framework without requiring a complete infrastructure overhaul from existing HPC NAS solutions. We demonstrate that retaining function-dedicated resources like NAS is not only possible, but can even be effected efficiently with MapReduce. In our exploration, we unearth subtle pitfalls resultant from this mash-up of new-era Big Data computation on conventional HPC storage and share the clever architectural configurations that allow us to avoid them. Last, we design and present a novel Hadoop File System, the Reliable Array of Independent NAS File System (RainFS), and experimentally demonstrate its improvements in performance and reliability over the previous architectures we have investigated.","PeriodicalId":170186,"journal":{"name":"2014 IEEE 34th International Conference on Distributed Computing Systems","volume":"39 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2014-06-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"10","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2014 IEEE 34th International Conference on Distributed Computing Systems","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICDCS.2014.60","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 10

Abstract

The Apache Hadoop framework has rung in a new era in how data-rich organizations can process, store, and analyze large amounts of data. This has resulted in increased potential for an infrastructure exodus from the traditional solution of commercial database ad-hoc analytics on network-attached storage (NAS). While many data-rich organizations can afford to either move entirely to Hadoop for their Big Data analytics, or to maintain their existing traditional infrastructures and acquire a new set of infrastructure solely for Hadoop jobs, most supercomputing centers do not enjoy either of those possibilities. Too much of the existing scientific code is tailored to work on massively parallel file systems unlike the Hadoop Distributed File System (HDFS), and their datasets are too large to reasonably maintain and/or ferry between two distinct storage systems. Nevertheless, as scientists search for easier-to-program frameworks with a lower time-to-science to post-process their huge datasets after execution, there is increasing pressure to enable use of MapReduce within these traditional High Performance Computing (HPC) architectures. Therefore, in this work we explore potential means to enable use of the easy-to-program Hadoop MapReduce framework without requiring a complete infrastructure overhaul from existing HPC NAS solutions. We demonstrate that retaining function-dedicated resources like NAS is not only possible, but can even be effected efficiently with MapReduce. In our exploration, we unearth subtle pitfalls resultant from this mash-up of new-era Big Data computation on conventional HPC storage and share the clever architectural configurations that allow us to avoid them. Last, we design and present a novel Hadoop File System, the Reliable Array of Independent NAS File System (RainFS), and experimentally demonstrate its improvements in performance and reliability over the previous architectures we have investigated.

查看原文本刊更多论文

他们会融合吗?:探索基于传统HPC NAS存储的大数据计算

Apache Hadoop框架开启了一个数据丰富的组织处理、存储和分析大量数据的新时代。这增加了基础设施从传统的基于网络附加存储(NAS)的商业数据库特别分析解决方案中流失的可能性。虽然许多数据丰富的组织可以完全转移到Hadoop进行大数据分析，或者维持现有的传统基础设施，并获得一套新的基础设施，仅用于Hadoop工作，但大多数超级计算中心都不享受这两种可能性。与Hadoop分布式文件系统(HDFS)不同，现有的科学代码中有太多是为大规模并行文件系统量身定制的，而且它们的数据集太大，无法合理地维护和/或在两个不同的存储系统之间传输。然而，随着科学家们寻找更容易编程的框架，并在执行后更短的时间内对其庞大的数据集进行后处理，在这些传统的高性能计算(HPC)架构中使用MapReduce的压力越来越大。因此，在这项工作中，我们探索了使用易于编程的Hadoop MapReduce框架的潜在方法，而不需要从现有的HPC NAS解决方案中进行完整的基础设施检修。我们证明保留像NAS这样的功能专用资源不仅是可能的，而且甚至可以通过MapReduce有效地实现。在我们的探索中，我们发现了新时代大数据计算在传统HPC存储上混搭所带来的微妙陷阱，并分享了巧妙的架构配置，使我们能够避免这些陷阱。最后，我们设计并提出了一个新的Hadoop文件系统，独立NAS文件系统的可靠阵列(RainFS)，并通过实验证明了它在性能和可靠性方面的改进。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2014 IEEE 34th International Conference on Distributed Computing Systems

自引率

0.00%

发文量