High Performance Alltoall and Allgather Designs for InfiniBand MIC Clusters

2014 IEEE 28th International Parallel and Distributed Processing Symposium Pub Date : 2014-05-19 DOI:10.1109/IPDPS.2014.72

Akshay Venkatesh, S. Potluri, R. Rajachandrasekar, Miao Luo, Khaled Hamidouche, D. Panda

{"title":"High Performance Alltoall and Allgather Designs for InfiniBand MIC Clusters","authors":"Akshay Venkatesh, S. Potluri, R. Rajachandrasekar, Miao Luo, Khaled Hamidouche, D. Panda","doi":"10.1109/IPDPS.2014.72","DOIUrl":null,"url":null,"abstract":"Intel's Many-Integrated-Core (MIC) architecture aims to provide Teraflop throughput (through high degrees of parallelism) with a high FLOP/Watt ratio and x86 compatibility. However, this two-fold approach to solving power and programmability challenges for Exascale computing is constrained by certain architectural idiosyncrasies. MIC coprocessors have a memory constrained environment and its processors operate at slower clock rates. Also, being PCI devices, the communication characteristics of MIC co-processors are different compared to communication behavior seen in homogeneous environments. For instance, the performance of sending data from the MIC memory to a remote node's memory through message passing routines has 3x-6x higher latency than sending from the host processor memory. Hence communication libraries that do not consider these architectural subtleties are likely to nullify performance benefits or even cause degradation in applications that intend to use MICs and rely heavily on communication routines. The performance of Message Passing Interface (MPI) operations, especially dense collective operations like All-to-all and All gather, strongly affect the performance of many distributed parallel applications. In this paper, we revisit state-of-the-art algorithms commonly used to implement All-to-all collectives and propose adaptations and optimizations to alleviate architectural bottlenecks on MIC clusters. We also propose a few novel designs to improve the communication latency of these operations. Through micro-benchmarks and applications, we substantiate the benefits of incorporating the proposed adaptations to the All-to-All collective operations. At the micro-benchmark level, the proposed designs show as much as 79% improvement for All gather operation and up to 70% improvement for All-to-all and with the P3DFFT application, an improvement of 38% is seen in overall execution time.","PeriodicalId":309291,"journal":{"name":"2014 IEEE 28th International Parallel and Distributed Processing Symposium","volume":"23 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2014-05-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"6","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2014 IEEE 28th International Parallel and Distributed Processing Symposium","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/IPDPS.2014.72","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 6

Abstract

Intel's Many-Integrated-Core (MIC) architecture aims to provide Teraflop throughput (through high degrees of parallelism) with a high FLOP/Watt ratio and x86 compatibility. However, this two-fold approach to solving power and programmability challenges for Exascale computing is constrained by certain architectural idiosyncrasies. MIC coprocessors have a memory constrained environment and its processors operate at slower clock rates. Also, being PCI devices, the communication characteristics of MIC co-processors are different compared to communication behavior seen in homogeneous environments. For instance, the performance of sending data from the MIC memory to a remote node's memory through message passing routines has 3x-6x higher latency than sending from the host processor memory. Hence communication libraries that do not consider these architectural subtleties are likely to nullify performance benefits or even cause degradation in applications that intend to use MICs and rely heavily on communication routines. The performance of Message Passing Interface (MPI) operations, especially dense collective operations like All-to-all and All gather, strongly affect the performance of many distributed parallel applications. In this paper, we revisit state-of-the-art algorithms commonly used to implement All-to-all collectives and propose adaptations and optimizations to alleviate architectural bottlenecks on MIC clusters. We also propose a few novel designs to improve the communication latency of these operations. Through micro-benchmarks and applications, we substantiate the benefits of incorporating the proposed adaptations to the All-to-All collective operations. At the micro-benchmark level, the proposed designs show as much as 79% improvement for All gather operation and up to 70% improvement for All-to-all and with the P3DFFT application, an improvement of 38% is seen in overall execution time.

查看原文本刊更多论文

InfiniBand MIC集群的高性能Alltoall和Allgather设计

英特尔的多集成核心(MIC)架构旨在提供Teraflop吞吐量(通过高度并行性)，具有高FLOP/Watt比和x86兼容性。然而，这种解决Exascale计算的能力和可编程性挑战的双重方法受到某些架构特性的限制。MIC协处理器具有内存受限的环境，其处理器以较慢的时钟速率运行。此外，作为PCI设备，MIC协处理器的通信特性与同构环境中的通信行为不同。例如，通过消息传递例程将数据从MIC内存发送到远程节点的内存的延迟比从主机处理器内存发送数据的延迟高3 -6倍。因此，如果通信库没有考虑到这些架构上的微妙之处，那么它很可能会抵消性能上的好处，甚至在打算使用mic并严重依赖通信例程的应用程序中导致性能下降。消息传递接口(Message Passing Interface, MPI)操作的性能，特别是密集的集体操作，如All-to- All和All- gather，会严重影响许多分布式并行应用程序的性能。在本文中，我们回顾了通常用于实现所有到所有集合的最先进算法，并提出了适应和优化，以缓解MIC集群上的架构瓶颈。我们还提出了一些新的设计来改善这些操作的通信延迟。通过微观基准和应用，我们证实了将拟议的调整纳入“所有对所有”集体操作的好处。在微基准测试水平上，建议的设计显示所有集合操作的改进高达79%，所有对所有操作的改进高达70%，使用P3DFFT应用程序，总体执行时间的改进为38%。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2014 IEEE 28th International Parallel and Distributed Processing Symposium

自引率

0.00%

发文量