TEMPI:一个具有cuda感知数据类型规范化表示的插入式MPI库

Proceedings of the 30th International Symposium on High-Performance Parallel and Distributed Computing Pub Date : 2020-12-28 DOI:10.1145/3431379.3460645

Carl Pearson, Kun Wu, I. Chung, Jinjun Xiong, Wen-mei W. Hwu

{"title":"TEMPI:一个具有cuda感知数据类型规范化表示的插入式MPI库","authors":"Carl Pearson, Kun Wu, I. Chung, Jinjun Xiong, Wen-mei W. Hwu","doi":"10.1145/3431379.3460645","DOIUrl":null,"url":null,"abstract":"MPI derived datatypes are an abstraction that simplifies handling of non-contiguous data in MPI applications. These datatypes are recursively constructed at runtime from primitive Named Types defined in the MPI standard. More recently, the development and deployment of CUDA-aware MPI implementations has encouraged the transition of distributed high-performance MPI codes to use GPUs. Such implementations allow MPI functions to directly operate on GPU buffers, easing integration of GPU compute into MPI codes. This work first presents a novel datatype handling strategy for nested strided datatypes, which finds a middle ground between the specialized or generic handling in prior work. This work also shows that the performance characteristics of non-contiguous data handling can be modeled with empirical system measurements, and used to transparently improve MPI_Send/Recv latency. Finally, despite substantial attention to non-contiguous GPU data and CUDA-aware MPI implementations, good performance cannot be taken for granted. This work demonstrates its contributions through an MPI interposer library, TEMPI. TEMPI can be used with existing MPI deployments without system or application changes. Ultimately, the interposed-library model of this work demonstrates MPI_Pack speedup of up to 242000x and MPI_Send speedup of up to 59000x compared to the MPI implementation deployed on a leadership-class supercomputer. This yields speedup of more than 917x in a 3D halo exchange with 3072 processes.","PeriodicalId":343991,"journal":{"name":"Proceedings of the 30th International Symposium on High-Performance Parallel and Distributed Computing","volume":"199 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2020-12-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"4","resultStr":"{\"title\":\"TEMPI: An Interposed MPI Library with a Canonical Representation of CUDA-aware Datatypes\",\"authors\":\"Carl Pearson, Kun Wu, I. Chung, Jinjun Xiong, Wen-mei W. Hwu\",\"doi\":\"10.1145/3431379.3460645\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"MPI derived datatypes are an abstraction that simplifies handling of non-contiguous data in MPI applications. These datatypes are recursively constructed at runtime from primitive Named Types defined in the MPI standard. More recently, the development and deployment of CUDA-aware MPI implementations has encouraged the transition of distributed high-performance MPI codes to use GPUs. Such implementations allow MPI functions to directly operate on GPU buffers, easing integration of GPU compute into MPI codes. This work first presents a novel datatype handling strategy for nested strided datatypes, which finds a middle ground between the specialized or generic handling in prior work. This work also shows that the performance characteristics of non-contiguous data handling can be modeled with empirical system measurements, and used to transparently improve MPI_Send/Recv latency. Finally, despite substantial attention to non-contiguous GPU data and CUDA-aware MPI implementations, good performance cannot be taken for granted. This work demonstrates its contributions through an MPI interposer library, TEMPI. TEMPI can be used with existing MPI deployments without system or application changes. Ultimately, the interposed-library model of this work demonstrates MPI_Pack speedup of up to 242000x and MPI_Send speedup of up to 59000x compared to the MPI implementation deployed on a leadership-class supercomputer. This yields speedup of more than 917x in a 3D halo exchange with 3072 processes.\",\"PeriodicalId\":343991,\"journal\":{\"name\":\"Proceedings of the 30th International Symposium on High-Performance Parallel and Distributed Computing\",\"volume\":\"199 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2020-12-28\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"4\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the 30th International Symposium on High-Performance Parallel and Distributed Computing\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3431379.3460645\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 30th International Symposium on High-Performance Parallel and Distributed Computing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3431379.3460645","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 4

摘要

MPI派生数据类型是一种抽象，它简化了MPI应用程序中非连续数据的处理。这些数据类型是在运行时从MPI标准中定义的基本命名类型递归地构造的。最近，支持cuda的MPI实现的开发和部署鼓励了分布式高性能MPI代码向使用gpu的过渡。这样的实现允许MPI函数直接在GPU缓冲区上操作，简化了GPU计算与MPI代码的集成。这项工作首先为嵌套跨行数据类型提出了一种新的数据类型处理策略，它在先前工作中的专门化或泛型处理之间找到了一个中间地带。这项工作还表明，非连续数据处理的性能特征可以用经验系统测量来建模，并用于透明地改善MPI_Send/Recv延迟。最后，尽管大量关注非连续GPU数据和cuda感知MPI实现，但良好的性能不能被视为理所当然。这项工作通过MPI中介程序库TEMPI展示了它的贡献。TEMPI可以与现有的MPI部署一起使用，而无需更改系统或应用程序。最后，与部署在领导级超级计算机上的MPI实现相比，本工作的插入库模型演示了高达242000x的MPI_Pack加速和高达59000x的MPI_Send加速。这在3072个进程的3D光晕交换中产生了超过917倍的加速。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

TEMPI: An Interposed MPI Library with a Canonical Representation of CUDA-aware Datatypes

MPI derived datatypes are an abstraction that simplifies handling of non-contiguous data in MPI applications. These datatypes are recursively constructed at runtime from primitive Named Types defined in the MPI standard. More recently, the development and deployment of CUDA-aware MPI implementations has encouraged the transition of distributed high-performance MPI codes to use GPUs. Such implementations allow MPI functions to directly operate on GPU buffers, easing integration of GPU compute into MPI codes. This work first presents a novel datatype handling strategy for nested strided datatypes, which finds a middle ground between the specialized or generic handling in prior work. This work also shows that the performance characteristics of non-contiguous data handling can be modeled with empirical system measurements, and used to transparently improve MPI_Send/Recv latency. Finally, despite substantial attention to non-contiguous GPU data and CUDA-aware MPI implementations, good performance cannot be taken for granted. This work demonstrates its contributions through an MPI interposer library, TEMPI. TEMPI can be used with existing MPI deployments without system or application changes. Ultimately, the interposed-library model of this work demonstrates MPI_Pack speedup of up to 242000x and MPI_Send speedup of up to 59000x compared to the MPI implementation deployed on a leadership-class supercomputer. This yields speedup of more than 917x in a 3D halo exchange with 3072 processes.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Proceedings of the 30th International Symposium on High-Performance Parallel and Distributed Computing

自引率

0.00%

发文量