{"title":"Exascale potholes for HPC: Execution performance and variability analysis of the flagship application code HemeLB","authors":"B. Wylie","doi":"10.1109/HUSTProtools51951.2020.00014","DOIUrl":null,"url":null,"abstract":"Performance measurement and analysis of parallel applications is often challenging, despite many excellent commercial and open-source tools being available. Currently envisaged exascale computer systems exacerbate matters by requiring extremely high scalability to effectively exploit millions of processor cores. Unfortunately, significant application execution performance variability arising from increasingly complex interactions between hardware and system software makes this situation much more difficult for application developers and performance analysts alike. This work considers the performance assessment of the HemeLB exascale flagship application code from the EU HPC Centre of Excellence (CoE) for Computational Biomedicine (CompBioMed) running on the SuperMUC-NG Tier-0 leadership system, using the methodology of the Performance Optimisation and Productivity (POP) CoE. Although 80% scaling efficiency is maintained to over 100,000 MPI processes, disappointing initial performance with more processes and corresponding poor strong scaling was identified to originate from the same few compute nodes in multiple runs, which later system diagnostic checks found had faulty DIMMs and lacklustre performance. Excluding these compute nodes from subsequent runs improved performance of executions with over 300,000 MPI processes by a factor of five, resulting in 190 x speed-up compared to 864 MPI processes. While communication efficiency remains very good up to the largest scale, parallel efficiency is primarily limited by load balance found to be largely due to core-to-core and run-to-run variability from excessive stalls for memory accesses, that affect many HPC systems with Intel Xeon Scalable processors. The POP methodology for this performance diagnosis is demonstrated via a detailed exposition with widely deployed ‘standard’ measurement and analysis tools.","PeriodicalId":38836,"journal":{"name":"Meta: Avaliacao","volume":"85 1","pages":"59-70"},"PeriodicalIF":0.0000,"publicationDate":"2020-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Meta: Avaliacao","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/HUSTProtools51951.2020.00014","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q4","JCRName":"Social Sciences","Score":null,"Total":0}
引用次数: 2
Abstract
Performance measurement and analysis of parallel applications is often challenging, despite many excellent commercial and open-source tools being available. Currently envisaged exascale computer systems exacerbate matters by requiring extremely high scalability to effectively exploit millions of processor cores. Unfortunately, significant application execution performance variability arising from increasingly complex interactions between hardware and system software makes this situation much more difficult for application developers and performance analysts alike. This work considers the performance assessment of the HemeLB exascale flagship application code from the EU HPC Centre of Excellence (CoE) for Computational Biomedicine (CompBioMed) running on the SuperMUC-NG Tier-0 leadership system, using the methodology of the Performance Optimisation and Productivity (POP) CoE. Although 80% scaling efficiency is maintained to over 100,000 MPI processes, disappointing initial performance with more processes and corresponding poor strong scaling was identified to originate from the same few compute nodes in multiple runs, which later system diagnostic checks found had faulty DIMMs and lacklustre performance. Excluding these compute nodes from subsequent runs improved performance of executions with over 300,000 MPI processes by a factor of five, resulting in 190 x speed-up compared to 864 MPI processes. While communication efficiency remains very good up to the largest scale, parallel efficiency is primarily limited by load balance found to be largely due to core-to-core and run-to-run variability from excessive stalls for memory accesses, that affect many HPC systems with Intel Xeon Scalable processors. The POP methodology for this performance diagnosis is demonstrated via a detailed exposition with widely deployed ‘standard’ measurement and analysis tools.
尽管有许多优秀的商业和开源工具可用,但并行应用程序的性能度量和分析通常是具有挑战性的。目前设想的百亿亿次计算机系统需要极高的可扩展性来有效地利用数百万个处理器核心,这使问题更加严重。不幸的是,由于硬件和系统软件之间日益复杂的交互而产生的显著的应用程序执行性能可变性,使得应用程序开发人员和性能分析人员都更加困难。这项工作考虑了在supermu - ng Tier-0领导系统上运行的欧盟高性能计算卓越中心(CoE)计算生物医学(CompBioMed)的HemeLB百亿级旗舰应用程序代码的性能评估,使用了性能优化和生产力(POP) CoE的方法。尽管在超过100,000个MPI进程的情况下保持了80%的扩展效率,但是随着更多的进程和相应的较弱的扩展,令人失望的初始性能被确定为源于多次运行中相同的几个计算节点,后来系统诊断检查发现这些节点有故障的内存和低迷的性能。将这些计算节点从后续运行中排除后,超过30万个MPI进程的执行性能提高了5倍,与864个MPI进程相比,速度提高了190倍。虽然通信效率在最大规模上仍然非常好,但并行效率主要受到负载平衡的限制,这主要是由于内核对内核和运行对运行的可变性,这些可变性来自内存访问的过度停滞,这影响了许多使用英特尔至强可扩展处理器的HPC系统。通过对广泛部署的“标准”测量和分析工具的详细阐述,演示了这种性能诊断的POP方法。