无专用资源的MPI高效异步通信进程

Proceedings of the 25th European MPI Users' Group Meeting Pub Date : 2018-09-23 DOI:10.1145/3236367.3236376

Amit Ruhela, H. Subramoni, S. Chakraborty, Mohammadreza Bayatpour, Pouya Kousha, D. Panda

{"title":"无专用资源的MPI高效异步通信进程","authors":"Amit Ruhela, H. Subramoni, S. Chakraborty, Mohammadreza Bayatpour, Pouya Kousha, D. Panda","doi":"10.1145/3236367.3236376","DOIUrl":null,"url":null,"abstract":"The overlap of computation and communication is critical for good performance of many HPC applications. State-of-the-art designs for the asynchronous progress require specially designed hardware resources (advanced switches or network interface cards), dedicated processor cores or application modification (e.g. use of MPI_Test). These techniques suffer from various issues like increasing code complexity/cost and loss of available compute resources for end applications. In this paper, we take up this challenge and propose a simple yet effective technique to achieve good overlap without needing any additional hardware or software resources. The proposed thread-based design allows MPI libraries to self-detect when asynchronous communication progress is needed and minimizes the number of context-switches and preemption between the main thread and the asynchronous progress thread. We evaluate the proposed design against state-of-the-art designs in other MPI libraries including MVAPICH2, Intel MPI, and Open MPI. We demonstrate benefits of the proposed approach at microbenchmark and at application level at scale on four different architectures including Intel Broadwell, Intel Xeon Phi (KNL), IBM OpenPOWER, and Intel Skylake with InfiniBand and Omni-Path interconnects. Compared to other state-of-the-art designs, our proposed approach shows upto 46%, 37%, and 49% improvement for All-to-one, One-to-all, and All-to-all communication patterns respectively collectives on 1,024 processes. We also show 38% performance improvement for SPEC MPI compute-intensive applications on 384 processes and 44% performance improvement with the P3DFFT application on 448 processes.","PeriodicalId":225539,"journal":{"name":"Proceedings of the 25th European MPI Users' Group Meeting","volume":"26 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2018-09-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"15","resultStr":"{\"title\":\"Efficient Asynchronous Communication Progress for MPI without Dedicated Resources\",\"authors\":\"Amit Ruhela, H. Subramoni, S. Chakraborty, Mohammadreza Bayatpour, Pouya Kousha, D. Panda\",\"doi\":\"10.1145/3236367.3236376\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"The overlap of computation and communication is critical for good performance of many HPC applications. State-of-the-art designs for the asynchronous progress require specially designed hardware resources (advanced switches or network interface cards), dedicated processor cores or application modification (e.g. use of MPI_Test). These techniques suffer from various issues like increasing code complexity/cost and loss of available compute resources for end applications. In this paper, we take up this challenge and propose a simple yet effective technique to achieve good overlap without needing any additional hardware or software resources. The proposed thread-based design allows MPI libraries to self-detect when asynchronous communication progress is needed and minimizes the number of context-switches and preemption between the main thread and the asynchronous progress thread. We evaluate the proposed design against state-of-the-art designs in other MPI libraries including MVAPICH2, Intel MPI, and Open MPI. We demonstrate benefits of the proposed approach at microbenchmark and at application level at scale on four different architectures including Intel Broadwell, Intel Xeon Phi (KNL), IBM OpenPOWER, and Intel Skylake with InfiniBand and Omni-Path interconnects. Compared to other state-of-the-art designs, our proposed approach shows upto 46%, 37%, and 49% improvement for All-to-one, One-to-all, and All-to-all communication patterns respectively collectives on 1,024 processes. We also show 38% performance improvement for SPEC MPI compute-intensive applications on 384 processes and 44% performance improvement with the P3DFFT application on 448 processes.\",\"PeriodicalId\":225539,\"journal\":{\"name\":\"Proceedings of the 25th European MPI Users' Group Meeting\",\"volume\":\"26 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2018-09-23\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"15\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the 25th European MPI Users' Group Meeting\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3236367.3236376\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 25th European MPI Users' Group Meeting","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3236367.3236376","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 15

摘要

计算和通信的重叠对于高性能计算应用程序的良好性能至关重要。最先进的异步进程设计需要专门设计的硬件资源(高级交换机或网络接口卡)，专用处理器内核或应用程序修改(例如使用MPI_Test)。这些技术面临着各种各样的问题，比如增加代码复杂性/成本和最终应用程序可用计算资源的损失。在本文中，我们接受了这一挑战，并提出了一种简单而有效的技术来实现良好的重叠，而无需任何额外的硬件或软件资源。建议的基于线程的设计允许MPI库在需要异步通信进程时进行自检测，并最大限度地减少主线程和异步进程线程之间的上下文切换和抢占的数量。我们将建议的设计与其他MPI库(包括MVAPICH2, Intel MPI和Open MPI)中的最先进设计进行比较。我们在四种不同的架构(包括Intel Broadwell、Intel Xeon Phi (KNL)、IBM OpenPOWER和Intel Skylake，以及InfiniBand和Omni-Path互连)上展示了所提出的方法在微基准测试和应用级别上的优势。与其他最先进的设计相比，我们提出的方法在1024个流程上分别对所有对一、一对对所有和所有对所有通信模式显示了高达46%、37%和49%的改进。我们还显示，SPEC MPI计算密集型应用程序在384个进程上的性能提高了38%，P3DFFT应用程序在448个进程上的性能提高了44%。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Efficient Asynchronous Communication Progress for MPI without Dedicated Resources

The overlap of computation and communication is critical for good performance of many HPC applications. State-of-the-art designs for the asynchronous progress require specially designed hardware resources (advanced switches or network interface cards), dedicated processor cores or application modification (e.g. use of MPI_Test). These techniques suffer from various issues like increasing code complexity/cost and loss of available compute resources for end applications. In this paper, we take up this challenge and propose a simple yet effective technique to achieve good overlap without needing any additional hardware or software resources. The proposed thread-based design allows MPI libraries to self-detect when asynchronous communication progress is needed and minimizes the number of context-switches and preemption between the main thread and the asynchronous progress thread. We evaluate the proposed design against state-of-the-art designs in other MPI libraries including MVAPICH2, Intel MPI, and Open MPI. We demonstrate benefits of the proposed approach at microbenchmark and at application level at scale on four different architectures including Intel Broadwell, Intel Xeon Phi (KNL), IBM OpenPOWER, and Intel Skylake with InfiniBand and Omni-Path interconnects. Compared to other state-of-the-art designs, our proposed approach shows upto 46%, 37%, and 49% improvement for All-to-one, One-to-all, and All-to-all communication patterns respectively collectives on 1,024 processes. We also show 38% performance improvement for SPEC MPI compute-intensive applications on 384 processes and 44% performance improvement with the P3DFFT application on 448 processes.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Proceedings of the 25th European MPI Users' Group Meeting

自引率

0.00%

发文量