Cellular automata beyond 100k cores: MPI vs Fortran coarrays

Proceedings of the 25th European MPI Users' Group Meeting Pub Date : 2018-09-23 DOI:10.1145/3236367.3236384

A. Shterenlikht, L. Cebamanos

{"title":"Cellular automata beyond 100k cores: MPI vs Fortran coarrays","authors":"A. Shterenlikht, L. Cebamanos","doi":"10.1145/3236367.3236384","DOIUrl":null,"url":null,"abstract":"Fortran coarrays are an attractive alternative to MPI due to a familiar Fortran syntax, single sided communications and implementation in the compiler. Scaling of coarrays is compared in this work to MPI, using cellular automata (CA) 3D Ising magnetisation miniapps, built with the CASUP CA library, https://cgpack.sourceforge.io, developed by the authors. Ising energy and magnetisation were calculated with MPI_ALLREDUCE and Fortran 2018 co_sum collectives. The work was done on ARCHER (Cray XC30) up to the full machine capacity: 109,056 cores. Ping-pong latency and bandwidth results are very similar with MPI and with coarrays for message sizes from 1B to several MB. MPI halo exchange (HX) scaled better than coarray HX, which is surprising because both algorithms use pair-wise communications: MPI IRECV/ISEND/WAITALL vs Fortran sync images. Adding OpenMP to MPI or to coarrays resulted in worse L2 cache hit ratio, and lower performance in all cases, even though the NUMA effects were ruled out. This is likely because the CA algorithm is memory and network bound. The sampling and tracing analysis shows good load balancing in compute in all miniapps, but imbalance in communication, indicating that the difference in performance between MPI and coarrays is likely due to parallel libraries (MPICH2 vs libpgas) and the Cray hardware specific libraries (uGNI vs DMAPP). Overall, the results look promising for coarray use beyond 100k cores. However, further coarray optimisation is needed to narrow the performance gap between coarrays and MPI.","PeriodicalId":225539,"journal":{"name":"Proceedings of the 25th European MPI Users' Group Meeting","volume":"22 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2018-09-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 25th European MPI Users' Group Meeting","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3236367.3236384","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 2

Abstract

Fortran coarrays are an attractive alternative to MPI due to a familiar Fortran syntax, single sided communications and implementation in the compiler. Scaling of coarrays is compared in this work to MPI, using cellular automata (CA) 3D Ising magnetisation miniapps, built with the CASUP CA library, https://cgpack.sourceforge.io, developed by the authors. Ising energy and magnetisation were calculated with MPI_ALLREDUCE and Fortran 2018 co_sum collectives. The work was done on ARCHER (Cray XC30) up to the full machine capacity: 109,056 cores. Ping-pong latency and bandwidth results are very similar with MPI and with coarrays for message sizes from 1B to several MB. MPI halo exchange (HX) scaled better than coarray HX, which is surprising because both algorithms use pair-wise communications: MPI IRECV/ISEND/WAITALL vs Fortran sync images. Adding OpenMP to MPI or to coarrays resulted in worse L2 cache hit ratio, and lower performance in all cases, even though the NUMA effects were ruled out. This is likely because the CA algorithm is memory and network bound. The sampling and tracing analysis shows good load balancing in compute in all miniapps, but imbalance in communication, indicating that the difference in performance between MPI and coarrays is likely due to parallel libraries (MPICH2 vs libpgas) and the Cray hardware specific libraries (uGNI vs DMAPP). Overall, the results look promising for coarray use beyond 100k cores. However, further coarray optimisation is needed to narrow the performance gap between coarrays and MPI.

查看原文本刊更多论文

超越 10 万核的细胞自动机：MPI 与 Fortran 数组

Fortran队列是MPI的一个有吸引力的替代方案，因为Fortran具有熟悉的语法、单面通信和在编译器中的实现。在这项工作中，使用由作者开发的CASUP CA库(https://cgpack.sourceforge.io)构建的细胞自动机(CA) 3D Ising磁化小应用程序，将同轴阵列的缩放与MPI进行了比较。利用MPI_ALLREDUCE和Fortran 2018 co_sum集合计算Ising能量和磁化强度。这项工作是在ARCHER (Cray XC30)上完成的，直到机器的全部容量:109,056个内核。乒乓延迟和带宽结果与MPI和队列非常相似，消息大小从1B到几MB。MPI halo exchange (HX)比队列HX扩展得更好，这是令人惊讶的，因为两种算法都使用成对通信:MPI IRECV/ISEND/WAITALL vs Fortran同步图像。将OpenMP添加到MPI或数组中会导致更差的L2缓存命中率，并且在所有情况下都降低了性能，即使排除了NUMA影响。这可能是因为CA算法受内存和网络的限制。采样和跟踪分析显示，所有miniapps的计算都有良好的负载平衡，但通信不平衡，表明MPI和阵列之间的性能差异可能是由于并行库(MPICH2 vs libpgas)和Cray硬件特定库(uGNI vs DMAPP)造成的。总的来说，对于超过100,000核的同轴阵列使用，结果看起来很有希望。然而，需要进一步的共阵优化来缩小共阵和MPI之间的性能差距。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Proceedings of the 25th European MPI Users' Group Meeting

自引率

0.00%

发文量