MARIANE: MApReduce Implementation Adapted for HPC Environments

2011 IEEE/ACM 12th International Conference on Grid Computing Pub Date : 2011-09-21 DOI:10.1109/Grid.2011.20

Zacharia Fadika, Elif Dede, M. Govindaraju, L. Ramakrishnan

{"title":"MARIANE: MApReduce Implementation Adapted for HPC Environments","authors":"Zacharia Fadika, Elif Dede, M. Govindaraju, L. Ramakrishnan","doi":"10.1109/Grid.2011.20","DOIUrl":null,"url":null,"abstract":"MapReduce is increasingly becoming a popular framework, and a potent programming model. The most popular open source implementation of MapReduce, Hadoop, is based on the Hadoop Distributed File System (HDFS). However, as HDFS is not POSIX compliant, it cannot be fully leveraged by applications running on a majority of existing HPC environments such as Teragrid and NERSC. These HPC environments typically support globally shared file systems such as NFS and GPFS. On such resourceful HPC infrastructures, the use of Hadoop not only creates compatibility issues, but also affects overall performance due to the added overhead of the HDFS. This paper not only presents a MapReduce implementation directly suitable for HPC environments, but also exposes the design choices for better performance gains in those settings. By leveraging inherent distributed file systems' functions, and abstracting them away from its MapReduce framework, MARIANE (MApReduce Implementation Adapted for HPC Environments) not only allows for the use of the model in an expanding number of HPC environments, but also allows for better performance in such settings. This paper shows the applicability and high performance of the MapReduce paradigm through MARIANE, an implementation designed for clustered and shared-disk file systems and as such not dedicated to a specific MapReduce solution. The paper identifies the components and trade-offs necessary for this model, and quantifies the performance gains exhibited by our approach in distributed environments over Apache Hadoop in a data intensive setting, on the Magellan test bed at the National Energy Research Scientific Computing Center (NERSC).","PeriodicalId":308086,"journal":{"name":"2011 IEEE/ACM 12th International Conference on Grid Computing","volume":"90 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2011-09-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"45","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2011 IEEE/ACM 12th International Conference on Grid Computing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/Grid.2011.20","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 45

Abstract

MapReduce is increasingly becoming a popular framework, and a potent programming model. The most popular open source implementation of MapReduce, Hadoop, is based on the Hadoop Distributed File System (HDFS). However, as HDFS is not POSIX compliant, it cannot be fully leveraged by applications running on a majority of existing HPC environments such as Teragrid and NERSC. These HPC environments typically support globally shared file systems such as NFS and GPFS. On such resourceful HPC infrastructures, the use of Hadoop not only creates compatibility issues, but also affects overall performance due to the added overhead of the HDFS. This paper not only presents a MapReduce implementation directly suitable for HPC environments, but also exposes the design choices for better performance gains in those settings. By leveraging inherent distributed file systems' functions, and abstracting them away from its MapReduce framework, MARIANE (MApReduce Implementation Adapted for HPC Environments) not only allows for the use of the model in an expanding number of HPC environments, but also allows for better performance in such settings. This paper shows the applicability and high performance of the MapReduce paradigm through MARIANE, an implementation designed for clustered and shared-disk file systems and as such not dedicated to a specific MapReduce solution. The paper identifies the components and trade-offs necessary for this model, and quantifies the performance gains exhibited by our approach in distributed environments over Apache Hadoop in a data intensive setting, on the Magellan test bed at the National Energy Research Scientific Computing Center (NERSC).

查看原文本刊更多论文

MARIANE:适用于高性能计算环境的MApReduce实现

MapReduce正日益成为一种流行的框架和一种强大的编程模型。MapReduce最流行的开源实现是Hadoop，它基于Hadoop分布式文件系统(HDFS)。然而，由于HDFS不兼容POSIX，它不能被运行在大多数现有HPC环境(如Teragrid和NERSC)上的应用程序充分利用。这些HPC环境通常支持全局共享的文件系统，如NFS和GPFS。在这种资源丰富的HPC基础设施上，使用Hadoop不仅会产生兼容性问题，而且由于增加了HDFS的开销，还会影响整体性能。本文不仅提出了一个直接适用于HPC环境的MapReduce实现，而且还揭示了在这些设置中获得更好性能提升的设计选择。通过利用固有的分布式文件系统的功能，并将它们从MapReduce框架中抽象出来，MARIANE (MapReduce Implementation Adapted for HPC Environments)不仅允许在越来越多的HPC环境中使用该模型，而且还允许在这些设置中获得更好的性能。本文通过MARIANE展示了MapReduce范例的适用性和高性能，MARIANE是为集群和共享磁盘文件系统设计的实现，因此并不专用于特定的MapReduce解决方案。本文确定了该模型所需的组件和权衡，并在国家能源研究科学计算中心(NERSC)的麦哲伦测试台上，在数据密集型设置的分布式环境中，在Apache Hadoop上量化了我们的方法所显示的性能增益。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2011 IEEE/ACM 12th International Conference on Grid Computing

自引率

0.00%

发文量