基于HDFS的MPI的可行性研究

2020 IEEE High Performance Extreme Computing Conference (HPEC) Pub Date : 2020-09-22 DOI:10.1109/HPEC43674.2020.9286250

Wu-chun Feng, Da Zhang, Jing Zhang, Kaixi Hou, S. Pumma, Hao Wang

{"title":"基于HDFS的MPI的可行性研究","authors":"Wu-chun Feng, Da Zhang, Jing Zhang, Kaixi Hou, S. Pumma, Hao Wang","doi":"10.1109/HPEC43674.2020.9286250","DOIUrl":null,"url":null,"abstract":"With the increasing prominence of integrating highperformance computing (HPC) with big-data (BIGDATA) processing, running MPI over the Hadoop Distributed File System (HDFS) offers a promising approach for delivering better scalability and fault tolerance to traditional HPC applications. However, it comes with challenges that discourage such an approach: (1) two-sided MPI communication to support intermediate data processing, (2) a focus on enabling N-1 writes that is subject to the default HDFS block-placement policy, and (3) a pipelined writing mode in HDFS that cannot fully utilize the underlying HPC hardware. So, while directly integrating MPI with HDFS may deliver better scalability and fault tolerance to MPI applications, it will fall short of delivering competitive performance. Consequently, we present a performance study to evaluate the feasibility of integrating MPI applications to run over HDFS. Specifically, we show that by aggregating and reordering intermediate data and coordinating computation and 110 when running MPI over HDFS, we can deliver up to 1.92x and 1.78x speedup over MPI I/O and HDFS pipelined-write implementations, respectively. Consequently, we present a performance study to evaluate the feasibility of integrating MPI applications to run over HDFS. Specifically, we show that by aggregating and reordering intermediate data and coordinating computation and 110 when running MPI over HDFS, we can deliver up to 1.92x and 1.78x speedup over MPI I/O and HDFS pipelined-write implementations, respectively.","PeriodicalId":168544,"journal":{"name":"2020 IEEE High Performance Extreme Computing Conference (HPEC)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2020-09-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"A Feasibility Study for MPI over HDFS\",\"authors\":\"Wu-chun Feng, Da Zhang, Jing Zhang, Kaixi Hou, S. Pumma, Hao Wang\",\"doi\":\"10.1109/HPEC43674.2020.9286250\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"With the increasing prominence of integrating highperformance computing (HPC) with big-data (BIGDATA) processing, running MPI over the Hadoop Distributed File System (HDFS) offers a promising approach for delivering better scalability and fault tolerance to traditional HPC applications. However, it comes with challenges that discourage such an approach: (1) two-sided MPI communication to support intermediate data processing, (2) a focus on enabling N-1 writes that is subject to the default HDFS block-placement policy, and (3) a pipelined writing mode in HDFS that cannot fully utilize the underlying HPC hardware. So, while directly integrating MPI with HDFS may deliver better scalability and fault tolerance to MPI applications, it will fall short of delivering competitive performance. Consequently, we present a performance study to evaluate the feasibility of integrating MPI applications to run over HDFS. Specifically, we show that by aggregating and reordering intermediate data and coordinating computation and 110 when running MPI over HDFS, we can deliver up to 1.92x and 1.78x speedup over MPI I/O and HDFS pipelined-write implementations, respectively. Consequently, we present a performance study to evaluate the feasibility of integrating MPI applications to run over HDFS. Specifically, we show that by aggregating and reordering intermediate data and coordinating computation and 110 when running MPI over HDFS, we can deliver up to 1.92x and 1.78x speedup over MPI I/O and HDFS pipelined-write implementations, respectively.\",\"PeriodicalId\":168544,\"journal\":{\"name\":\"2020 IEEE High Performance Extreme Computing Conference (HPEC)\",\"volume\":\"1 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2020-09-22\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2020 IEEE High Performance Extreme Computing Conference (HPEC)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/HPEC43674.2020.9286250\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2020 IEEE High Performance Extreme Computing Conference (HPEC)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/HPEC43674.2020.9286250","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

随着高性能计算(HPC)与大数据(BIGDATA)处理集成的日益突出，在Hadoop分布式文件系统(HDFS)上运行MPI提供了一种有前途的方法，可以为传统的HPC应用程序提供更好的可扩展性和容错能力。然而，它带来了阻碍这种方法的挑战:(1)双面MPI通信以支持中间数据处理，(2)专注于启用N-1写入，这受制于默认的HDFS块放置策略，以及(3)HDFS中的流水线写入模式不能充分利用底层HPC硬件。因此，虽然直接将MPI与HDFS集成可以为MPI应用程序提供更好的可伸缩性和容错性，但它将无法提供具有竞争力的性能。因此，我们提出了一项性能研究来评估将MPI应用程序集成到HDFS上运行的可行性。具体来说，我们表明，当在HDFS上运行MPI时，通过聚合和重新排序中间数据以及协调计算和110，我们可以提供比MPI I/O和HDFS管道写入实现分别高达1.92倍和1.78倍的加速。因此，我们提出了一项性能研究来评估将MPI应用程序集成到HDFS上运行的可行性。具体来说，我们表明，当在HDFS上运行MPI时，通过聚合和重新排序中间数据以及协调计算和110，我们可以提供比MPI I/O和HDFS管道写入实现分别高达1.92倍和1.78倍的加速。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

A Feasibility Study for MPI over HDFS

With the increasing prominence of integrating highperformance computing (HPC) with big-data (BIGDATA) processing, running MPI over the Hadoop Distributed File System (HDFS) offers a promising approach for delivering better scalability and fault tolerance to traditional HPC applications. However, it comes with challenges that discourage such an approach: (1) two-sided MPI communication to support intermediate data processing, (2) a focus on enabling N-1 writes that is subject to the default HDFS block-placement policy, and (3) a pipelined writing mode in HDFS that cannot fully utilize the underlying HPC hardware. So, while directly integrating MPI with HDFS may deliver better scalability and fault tolerance to MPI applications, it will fall short of delivering competitive performance. Consequently, we present a performance study to evaluate the feasibility of integrating MPI applications to run over HDFS. Specifically, we show that by aggregating and reordering intermediate data and coordinating computation and 110 when running MPI over HDFS, we can deliver up to 1.92x and 1.78x speedup over MPI I/O and HDFS pipelined-write implementations, respectively. Consequently, we present a performance study to evaluate the feasibility of integrating MPI applications to run over HDFS. Specifically, we show that by aggregating and reordering intermediate data and coordinating computation and 110 when running MPI over HDFS, we can deliver up to 1.92x and 1.78x speedup over MPI I/O and HDFS pipelined-write implementations, respectively.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2020 IEEE High Performance Extreme Computing Conference (HPEC)

自引率

0.00%

发文量