DataMPI: Extending MPI to Hadoop-Like Big Data Computing

2014 IEEE 28th International Parallel and Distributed Processing Symposium Pub Date : 2014-05-19 DOI:10.1109/IPDPS.2014.90

Xiaoyi Lu, Fan Liang, Bing Wang, L. Zha, Zhiwei Xu

{"title":"DataMPI: Extending MPI to Hadoop-Like Big Data Computing","authors":"Xiaoyi Lu, Fan Liang, Bing Wang, L. Zha, Zhiwei Xu","doi":"10.1109/IPDPS.2014.90","DOIUrl":null,"url":null,"abstract":"MPI has been widely used in High Performance Computing. In contrast, such efficient communication support is lacking in the field of Big Data Computing, where communication is realized by time consuming techniques such as HTTP/RPC. This paper takes a step in bridging these two fields by extending MPI to support Hadoop-like Big Data Computing jobs, where processing and communication of a large number of key-value pair instances are needed through distributed computation models such as MapReduce, Iteration, and Streaming. We abstract the characteristics of key-value communication patterns into a bipartite communication model, which reveals four distinctions from MPI: Dichotomic, Dynamic, Data-centric, and Diversified features. Utilizing this model, we propose the specification of a minimalistic extension to MPI. An open source communication library, DataMPI, is developed to implement this specification. Performance experiments show that DataMPI has significant advantages in performance and flexibility, while maintaining high productivity, scalability, and fault tolerance of Hadoop.","PeriodicalId":309291,"journal":{"name":"2014 IEEE 28th International Parallel and Distributed Processing Symposium","volume":"15 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2014-05-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"64","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2014 IEEE 28th International Parallel and Distributed Processing Symposium","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/IPDPS.2014.90","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 64

Abstract

MPI has been widely used in High Performance Computing. In contrast, such efficient communication support is lacking in the field of Big Data Computing, where communication is realized by time consuming techniques such as HTTP/RPC. This paper takes a step in bridging these two fields by extending MPI to support Hadoop-like Big Data Computing jobs, where processing and communication of a large number of key-value pair instances are needed through distributed computation models such as MapReduce, Iteration, and Streaming. We abstract the characteristics of key-value communication patterns into a bipartite communication model, which reveals four distinctions from MPI: Dichotomic, Dynamic, Data-centric, and Diversified features. Utilizing this model, we propose the specification of a minimalistic extension to MPI. An open source communication library, DataMPI, is developed to implement this specification. Performance experiments show that DataMPI has significant advantages in performance and flexibility, while maintaining high productivity, scalability, and fault tolerance of Hadoop.

查看原文本刊更多论文

DataMPI:将MPI扩展到类似hadoop的大数据计算

MPI在高性能计算中得到了广泛的应用。相比之下，在大数据计算领域缺乏这种高效的通信支持，在大数据计算领域，通信是通过HTTP/RPC等耗时的技术实现的。本文通过扩展MPI来支持类似hadoop的大数据计算作业，从而在这两个领域之间架起了桥梁。在这些作业中，需要通过分布式计算模型(如MapReduce、Iteration和Streaming)来处理和通信大量的键值对实例。我们将键值通信模式的特征抽象为一个二部通信模型，该模型揭示了与MPI的四个区别:二分类特征、动态特征、数据中心特征和多样化特征。利用这个模型，我们提出了MPI的简约扩展规范。开发了一个开源通信库DataMPI来实现该规范。性能实验表明，DataMPI在性能和灵活性方面具有明显优势，同时保持了Hadoop的高生产率、可扩展性和容错性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2014 IEEE 28th International Parallel and Distributed Processing Symposium

自引率

0.00%

发文量