Autotuning MPI Collectives using Performance Guidelines

Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region Pub Date : 2018-01-28 DOI:10.1145/3149457.3149461

S. Hunold, Alexandra Carpen-Amarie

{"title":"Autotuning MPI Collectives using Performance Guidelines","authors":"S. Hunold, Alexandra Carpen-Amarie","doi":"10.1145/3149457.3149461","DOIUrl":null,"url":null,"abstract":"MPI collective operations provide a standardized interface for performing data movements within a group of processes. The efficiency of collective communication operations depends on the actual algorithm, its implementation, and the specific communication problem (type of communication, message size, and number of processes). Many MPI libraries provide numerous algorithms for specific collective operations. The strategy for selecting an efficient algorithm is often times predefined (hard-coded) in MPI libraries, but some of them, such as Open MPI, allow users to change the algorithm manually. Finding the best algorithm for each case is a hard problem, and several approaches to tune these algorithmic parameters have been proposed. We use an orthogonal approach to the parameter-tuning of MPI collectives, that is, instead of testing individual algorithmic choices provided by an MPI library, we compare the latency of a specific MPI collective operation to the latency of semantically equivalent functions, which we call the mock-up implementations. The structure of the mock-up implementations is defined by self-consistent performance guidelines. The advantage of this approach is that tuning using mock-up implementations is always possible, whether or not an MPI library allows users to select a specific algorithm at run-time. We implement this concept in a library called PGMPITuneLib, which is layered between the user code and the actual MPI implementation. This library selects the best-performing algorithmic pattern of an MPI collective by intercepting MPI calls and redirecting them to our mock-up implementations. Experimental results show that PGMPITuneLib can significantly reduce the latency of MPI collectives, and also equally important, that it can help identifying the tuning potential of MPI libraries.","PeriodicalId":314778,"journal":{"name":"Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region","volume":"1396 ","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2018-01-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"11","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3149457.3149461","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 11

Abstract

MPI collective operations provide a standardized interface for performing data movements within a group of processes. The efficiency of collective communication operations depends on the actual algorithm, its implementation, and the specific communication problem (type of communication, message size, and number of processes). Many MPI libraries provide numerous algorithms for specific collective operations. The strategy for selecting an efficient algorithm is often times predefined (hard-coded) in MPI libraries, but some of them, such as Open MPI, allow users to change the algorithm manually. Finding the best algorithm for each case is a hard problem, and several approaches to tune these algorithmic parameters have been proposed. We use an orthogonal approach to the parameter-tuning of MPI collectives, that is, instead of testing individual algorithmic choices provided by an MPI library, we compare the latency of a specific MPI collective operation to the latency of semantically equivalent functions, which we call the mock-up implementations. The structure of the mock-up implementations is defined by self-consistent performance guidelines. The advantage of this approach is that tuning using mock-up implementations is always possible, whether or not an MPI library allows users to select a specific algorithm at run-time. We implement this concept in a library called PGMPITuneLib, which is layered between the user code and the actual MPI implementation. This library selects the best-performing algorithmic pattern of an MPI collective by intercepting MPI calls and redirecting them to our mock-up implementations. Experimental results show that PGMPITuneLib can significantly reduce the latency of MPI collectives, and also equally important, that it can help identifying the tuning potential of MPI libraries.

查看原文本刊更多论文

使用性能指南自动调优MPI集合

MPI集合操作为在一组进程中执行数据移动提供了标准化接口。集体通信操作的效率取决于实际的算法、它的实现和特定的通信问题(通信类型、消息大小和进程数量)。许多MPI库为特定的集合操作提供了许多算法。选择有效算法的策略通常是MPI库中预定义的(硬编码的)，但是其中一些(如Open MPI)允许用户手动更改算法。为每种情况找到最佳算法是一个难题，并且已经提出了几种方法来调整这些算法参数。我们使用正交方法对MPI集合进行参数调优，也就是说，我们不是测试MPI库提供的单个算法选择，而是将特定MPI集合操作的延迟与语义等效函数的延迟进行比较，我们称之为模拟实现。模型实现的结构由自一致的性能准则定义。这种方法的优点是，无论MPI库是否允许用户在运行时选择特定的算法，始终可以使用模拟实现进行调优。我们在名为PGMPITuneLib的库中实现了这个概念，该库位于用户代码和实际MPI实现之间。这个库通过拦截MPI调用并将它们重定向到我们的模拟实现来选择MPI集合中性能最好的算法模式。实验结果表明，PGMPITuneLib可以显著降低MPI集合的延迟，同样重要的是，它可以帮助识别MPI库的调优潜力。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region

自引率

0.00%

发文量