2022 IEEE/ACM International Workshop on Performance Modeling, Benchmarking and Simulation of High Performance Computer Systems (PMBS)最新文献

Going green: optimizing GPUs for energy efficiency through model-steered auto-tuning 走向绿色:通过模型导向自动调整优化gpu的能源效率

2022 IEEE/ACM International Workshop on Performance Modeling, Benchmarking and Simulation of High Performance Computer Systems (PMBS) Pub Date : 2022-11-01 DOI: 10.1109/PMBS56514.2022.00010

R. Schoonhoven, B. Veenboer, B. V. Werkhoven, K. Batenburg

{"title":"Going green: optimizing GPUs for energy efficiency through model-steered auto-tuning","authors":"R. Schoonhoven, B. Veenboer, B. V. Werkhoven, K. Batenburg","doi":"10.1109/PMBS56514.2022.00010","DOIUrl":"https://doi.org/10.1109/PMBS56514.2022.00010","url":null,"abstract":"Graphics Processing Units (GPUs) have revolutionized the computing landscape over the past decade. However, the growing energy demands of data centres and computing facilities equipped with GPUs come with significant capital and environmental costs. The energy consumption of GPU applications greatly depend on how well they are optimized. Auto-tuning is an effective and commonly applied technique of finding the optimal combination of algorithm, application, and hardware parameters to optimize performance of a GPU application. In this paper, we introduce new energy monitoring and optimization capabilities in Kernel Tuner, a generic auto-tuning tool for GPU applications. These capabilities enable us to investigate the difference between tuning for execution time and various approaches to improve energy efficiency, and investigate the differences in tuning difficulty. Additionally, our model for GPU power consumption greatly reduces the large tuning search space by providing clock frequencies for which a GPU is likely most energy efficient.","PeriodicalId":321991,"journal":{"name":"2022 IEEE/ACM International Workshop on Performance Modeling, Benchmarking and Simulation of High Performance Computer Systems (PMBS)","volume":"30 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123370005","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

Evaluating ISO C++ Parallel Algorithms on Heterogeneous HPC Systems 异构HPC系统上ISO c++并行算法的评估

2022 IEEE/ACM International Workshop on Performance Modeling, Benchmarking and Simulation of High Performance Computer Systems (PMBS) Pub Date : 2022-11-01 DOI: 10.1109/PMBS56514.2022.00009

Wei-Chen Lin, Tom Deakin, Simon McIntosh-Smith

{"title":"Evaluating ISO C++ Parallel Algorithms on Heterogeneous HPC Systems","authors":"Wei-Chen Lin, Tom Deakin, Simon McIntosh-Smith","doi":"10.1109/PMBS56514.2022.00009","DOIUrl":"https://doi.org/10.1109/PMBS56514.2022.00009","url":null,"abstract":"Recent revisions to the ISO C++ standard have added specifications for parallel algorithms. These additions cover common use-cases, including sequence traversal, reduction, and even sorting, many of which are highly applicable in HPC, and thus represent a potential for increased performance and productivity.This study evaluates the state of the art for implementing heterogeneous HPC applications using the latest built-in ISO C++17 parallel algorithms. We implement C++17 ports of representative HPC mini-apps that cover both compute-bound and memory bandwidth-bound applications. We then conduct benchmarks on CPUs and GPUs, comparing our ports to other widely-available parallel programming models, such as OpenMP, CUDA, and SYCL.Finally, we show that C++17 parallel algorithms are able to achieve competitive performance across multiple mini-apps on many platforms, with some notable exceptions. We also discuss several key topics, including portability, and describe workarounds for a number of remaining issues, including index-based traversal and accelerator device/memory management.","PeriodicalId":321991,"journal":{"name":"2022 IEEE/ACM International Workshop on Performance Modeling, Benchmarking and Simulation of High Performance Computer Systems (PMBS)","volume":"06 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116661513","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

Performance Analysis with Unified Hardware Counter Metrics 统一硬件计数器指标的性能分析

2022 IEEE/ACM International Workshop on Performance Modeling, Benchmarking and Simulation of High Performance Computer Systems (PMBS) Pub Date : 2022-11-01 DOI: 10.1109/PMBS56514.2022.00011

B. Gravelle, W. Nystrom, B. Norris

引用次数: 0

A Methodology for Evaluating Tightly-integrated and Disaggregated Accelerated Architectures 一种评估紧密集成和分解加速体系结构的方法

2022 IEEE/ACM International Workshop on Performance Modeling, Benchmarking and Simulation of High Performance Computer Systems (PMBS) Pub Date : 2022-11-01 DOI: 10.1109/PMBS56514.2022.00012

Taylor L. Groves, C. Daley, Rahulkumar Gayatri, H. Nam, Nan Ding, Lenny Oliker, N. Wright, Samuel Williams

{"title":"A Methodology for Evaluating Tightly-integrated and Disaggregated Accelerated Architectures","authors":"Taylor L. Groves, C. Daley, Rahulkumar Gayatri, H. Nam, Nan Ding, Lenny Oliker, N. Wright, Samuel Williams","doi":"10.1109/PMBS56514.2022.00012","DOIUrl":"https://doi.org/10.1109/PMBS56514.2022.00012","url":null,"abstract":"Tighter integration of computational resources can foster superior application performance by mitigating communication bottlenecks. Unfortunately, not every application can use every compute or accelerator all the time. As a result, co-locating resources often leads to under-utilization of resources. To mitigate this challenge, architects have proposed disaggregation and ad hoc pooling of computational resources. In the next five years, HPC system architects will be presented with a spectrum of accelerated solutions ranging from tightly coupled, single package APUs to a sea of disaggregated GPUs interconnected by a global network. In this paper, we detail NEthing, our methodology and tool for evaluating the potential performance implications of such diverse architectural paradigms. We demonstrate our methodology on today’s and projected 2026 technologies for three distinct workloads: a compute-intensive kernel, a tightly-coupled HPC simulation, and an ensemble of loosely-coupled HPC simulations. Our results leverage NEthing to quantify the increased utilization disaggregated systems must achieve in order to match superior performance of APUs and on-board GPUs.","PeriodicalId":321991,"journal":{"name":"2022 IEEE/ACM International Workshop on Performance Modeling, Benchmarking and Simulation of High Performance Computer Systems (PMBS)","volume":"22 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115315561","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

High-Performance GMRES Multi-Precision Benchmark: Design, Performance, and Challenges 高性能GMRES多精度基准:设计、性能和挑战

2022 IEEE/ACM International Workshop on Performance Modeling, Benchmarking and Simulation of High Performance Computer Systems (PMBS) Pub Date : 2022-11-01 DOI: 10.1109/PMBS56514.2022.00015

I. Yamazaki, Christian A. Glusa, J. Loe, P. Luszczek, S. Rajamanickam, J. Dongarra

{"title":"High-Performance GMRES Multi-Precision Benchmark: Design, Performance, and Challenges","authors":"I. Yamazaki, Christian A. Glusa, J. Loe, P. Luszczek, S. Rajamanickam, J. Dongarra","doi":"10.1109/PMBS56514.2022.00015","DOIUrl":"https://doi.org/10.1109/PMBS56514.2022.00015","url":null,"abstract":"We propose a new benchmark for high-performance (HP) computers. Similar to High Performance Conjugate Gradient (HPCG), the new benchmark is designed to rank computers based on how fast they can solve a sparse linear system of equations, exhibiting computational and communication requirements typical in many scientific applications. The main novelty of the new benchmark is that it is now based on Generalized Minimum Residual method (GMRES) (combined with Geometric Multi-Grid preconditioner and Gauss-Seidel smoother) and provides the flexibility to utilize lower precision arithmetic. This is motivated by new hardware architectures that deliver lower-precision arithmetic at higher performance. There are other machines that do not follow this trend. However, using a lower-precision arithmetic reduces the required amount of data transfer, which alone could improve solver performance. Considering these trends, an HP benchmark that allows the use of different precisions for solving important scientific problems will be valuable for many different disciplines, and we also hope to promote the design of future HP computers that can utilize mixed-precision arithmetic for achieving high application performance. We present our initial design of the new benchmark, its reference implementation, and the performance of the reference mixed (double and single) precision Geometric Multi-Grid solvers on current top-ranked architectures. We also discuss challenges of designing such a benchmark, along with our preliminary numerical results using 16-bit numerical values (half and bfloat precisions) for solving a sparse linear system of equations.","PeriodicalId":321991,"journal":{"name":"2022 IEEE/ACM International Workshop on Performance Modeling, Benchmarking and Simulation of High Performance Computer Systems (PMBS)","volume":"11 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124001808","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

OMPICollTune: Autotuning MPI Collectives by Incremental Online Learning OMPICollTune:自动调整MPI集体增量在线学习

2022 IEEE/ACM International Workshop on Performance Modeling, Benchmarking and Simulation of High Performance Computer Systems (PMBS) Pub Date : 2022-11-01 DOI: 10.1109/PMBS56514.2022.00016

S. Hunold, Sebastian Steiner

{"title":"OMPICollTune: Autotuning MPI Collectives by Incremental Online Learning","authors":"S. Hunold, Sebastian Steiner","doi":"10.1109/PMBS56514.2022.00016","DOIUrl":"https://doi.org/10.1109/PMBS56514.2022.00016","url":null,"abstract":"Collective communication operations, such as Broadcast or Reduce, are fundamental cornerstones in many high-performance applications. Most collective operations can be implemented using different algorithms, each of which has advantages and disadvantages. For that reason, MPI libraries typically implement a selection logic that attempts to make good algorithmic choices for specific problem instances. It has been shown in the literature that the hard-coded algorithm selection logic found in MPI libraries can be improved by tuning the collectives in a separate, offline micro-benchmarking run.In the present paper, we go a fundamentally different way of improving the algorithm selection for MPI collectives. We integrate the probing of different algorithms directly into the MPI library. Whenever an MPI application is started with a given process configuration, i.e., the number of nodes and the processes per node, the tuner, instead of the default selection logic, finds the next algorithm to complete an issued MPI collective call. The tuner records the runtime of this MPI call for a subset of processes. With the recorded performance data, the tuner is able to build a performance model that allows selecting an efficient algorithm for a given collective problem. Subsequently recorded performance results are then used to update the performance model, where the probabilities for selecting an algorithm are adapted by the tuner, such that slow algorithms get a smaller chance of being selected. We show in a case study, using the ECP proxy application miniAMR, that our approach can effectively tune the performance of Allreduce.","PeriodicalId":321991,"journal":{"name":"2022 IEEE/ACM International Workshop on Performance Modeling, Benchmarking and Simulation of High Performance Computer Systems (PMBS)","volume":"15 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128321507","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

Benchmarking Fortran DO CONCURRENT on CPUs and GPUs Using BabelStream 使用BabelStream在cpu和gpu上测试Fortran DO并发

2022 IEEE/ACM International Workshop on Performance Modeling, Benchmarking and Simulation of High Performance Computer Systems (PMBS) Pub Date : 2022-11-01 DOI: 10.1109/PMBS56514.2022.00013

J. Hammond, Tom Deakin, J. Cownie, Simon McIntosh-Smith

引用次数: 2

Time-series ML-regression on Graphcore IPU-M2000 and Nvidia A100 Graphcore IPU-M2000和Nvidia A100上的时间序列ml回归

2022 IEEE/ACM International Workshop on Performance Modeling, Benchmarking and Simulation of High Performance Computer Systems (PMBS) Pub Date : 2022-11-01 DOI: 10.1109/PMBS56514.2022.00019

Jan Balewski, Z. Liu, A. Tsyplikhin, Manuel Lopez Roland, Kristofer E Bouchard

引用次数: 1

ML-based Performance Portability for Time-Dependent Density Functional Theory in HPC Environments HPC环境下基于ml的时变密度泛函理论的性能可移植性

2022 IEEE/ACM International Workshop on Performance Modeling, Benchmarking and Simulation of High Performance Computer Systems (PMBS) Pub Date : 2022-11-01 DOI: 10.1109/PMBS56514.2022.00006

A. P. Diéguez, Min Choi, Xinran Zhu, Bryan M. Wong, K. Ibrahim

{"title":"ML-based Performance Portability for Time-Dependent Density Functional Theory in HPC Environments","authors":"A. P. Diéguez, Min Choi, Xinran Zhu, Bryan M. Wong, K. Ibrahim","doi":"10.1109/PMBS56514.2022.00006","DOIUrl":"https://doi.org/10.1109/PMBS56514.2022.00006","url":null,"abstract":"Time-Dependent Density Functional Theory (TDDFT) workloads are an example of high-impact computational methods that require leveraging the performance of HPC architectures. However, finding the optimal values of their performance-critical parameters raises performance portability challenges that must be addressed. In this work, we propose an ML-based tuning methodology based on Bayesian optimization and transfer learning to tackle the performance portability for TDDFT codes in HPC systems. Our results demonstrate the effectiveness of our transfer-learning proposal for TDDFT workloads, which reduced the number of executed evaluations by up to 86% compared to an exhaustive search for the global optimal performance parameters on the Cori and Perlmutter supercomputers. Compared to a Bayesian-optimization search, our proposal reduces the required evaluations by up to 46.7% to find the same optimal runtime configuration. Overall, this methodology can be applied to other scientific workloads for current and emerging high-performance architectures.","PeriodicalId":321991,"journal":{"name":"2022 IEEE/ACM International Workshop on Performance Modeling, Benchmarking and Simulation of High Performance Computer Systems (PMBS)","volume":"49 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134133175","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

A Comprehensive Evaluation of Novel AI Accelerators for Deep Learning Workloads 深度学习负载下新型人工智能加速器的综合评价

2022 IEEE/ACM International Workshop on Performance Modeling, Benchmarking and Simulation of High Performance Computer Systems (PMBS) Pub Date : 2022-11-01 DOI: 10.1109/PMBS56514.2022.00007

M. Emani, Zhen Xie, Siddhisanket Raskar, V. Sastry, William Arnold, Bruce Wilson, R. Thakur, V. Vishwanath, Zhengchun Liu, M. Papka, Cindy Orozco Bohorquez, Rickey C. Weisner, K. Li, Yongning Sheng, Yun Du, Jian Zhang, A. Tsyplikhin, Gurdaman S. Khaira, J. Fowers, R. Sivakumar, Victoria Godsoe, Adrián Macías, Chetan Tekur, Matthew Boyd

{"title":"A Comprehensive Evaluation of Novel AI Accelerators for Deep Learning Workloads","authors":"M. Emani, Zhen Xie, Siddhisanket Raskar, V. Sastry, William Arnold, Bruce Wilson, R. Thakur, V. Vishwanath, Zhengchun Liu, M. Papka, Cindy Orozco Bohorquez, Rickey C. Weisner, K. Li, Yongning Sheng, Yun Du, Jian Zhang, A. Tsyplikhin, Gurdaman S. Khaira, J. Fowers, R. Sivakumar, Victoria Godsoe, Adrián Macías, Chetan Tekur, Matthew Boyd","doi":"10.1109/PMBS56514.2022.00007","DOIUrl":"https://doi.org/10.1109/PMBS56514.2022.00007","url":null,"abstract":"Scientific applications are increasingly adopting Artificial Intelligence (AI) techniques to advance science. High-performance computing centers are evaluating emerging novel hardware accelerators to efficiently run AI-driven science applications. With a wide diversity in the hardware architectures and software stacks of these systems, it is challenging to understand how these accelerators perform. The state-of-the-art in the evaluation of deep learning workloads primarily focuses on CPUs and GPUs. In this paper, we present an overview of dataflow-based novel AI accelerators from SambaNova, Cerebras, Graphcore, and Groq. We present a first-of-a-kind evaluation of these accelerators with diverse workloads, such as Deep Learning (DL) primitives, benchmark models, and scientific machine learning applications. We also evaluate the performance of collective communication, which is key for distributed DL implementation, along with a study of scaling efficiency. We then discuss key insights, challenges, and opportunities in integrating these novel AI accelerators in supercomputing systems.","PeriodicalId":321991,"journal":{"name":"2022 IEEE/ACM International Workshop on Performance Modeling, Benchmarking and Simulation of High Performance Computer Systems (PMBS)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125769099","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 5