R. Schoonhoven, B. Veenboer, B. V. Werkhoven, K. Batenburg
{"title":"Going green: optimizing GPUs for energy efficiency through model-steered auto-tuning","authors":"R. Schoonhoven, B. Veenboer, B. V. Werkhoven, K. Batenburg","doi":"10.1109/PMBS56514.2022.00010","DOIUrl":"https://doi.org/10.1109/PMBS56514.2022.00010","url":null,"abstract":"Graphics Processing Units (GPUs) have revolutionized the computing landscape over the past decade. However, the growing energy demands of data centres and computing facilities equipped with GPUs come with significant capital and environmental costs. The energy consumption of GPU applications greatly depend on how well they are optimized. Auto-tuning is an effective and commonly applied technique of finding the optimal combination of algorithm, application, and hardware parameters to optimize performance of a GPU application. In this paper, we introduce new energy monitoring and optimization capabilities in Kernel Tuner, a generic auto-tuning tool for GPU applications. These capabilities enable us to investigate the difference between tuning for execution time and various approaches to improve energy efficiency, and investigate the differences in tuning difficulty. Additionally, our model for GPU power consumption greatly reduces the large tuning search space by providing clock frequencies for which a GPU is likely most energy efficient.","PeriodicalId":321991,"journal":{"name":"2022 IEEE/ACM International Workshop on Performance Modeling, Benchmarking and Simulation of High Performance Computer Systems (PMBS)","volume":"30 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123370005","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Evaluating ISO C++ Parallel Algorithms on Heterogeneous HPC Systems","authors":"Wei-Chen Lin, Tom Deakin, Simon McIntosh-Smith","doi":"10.1109/PMBS56514.2022.00009","DOIUrl":"https://doi.org/10.1109/PMBS56514.2022.00009","url":null,"abstract":"Recent revisions to the ISO C++ standard have added specifications for parallel algorithms. These additions cover common use-cases, including sequence traversal, reduction, and even sorting, many of which are highly applicable in HPC, and thus represent a potential for increased performance and productivity.This study evaluates the state of the art for implementing heterogeneous HPC applications using the latest built-in ISO C++17 parallel algorithms. We implement C++17 ports of representative HPC mini-apps that cover both compute-bound and memory bandwidth-bound applications. We then conduct benchmarks on CPUs and GPUs, comparing our ports to other widely-available parallel programming models, such as OpenMP, CUDA, and SYCL.Finally, we show that C++17 parallel algorithms are able to achieve competitive performance across multiple mini-apps on many platforms, with some notable exceptions. We also discuss several key topics, including portability, and describe workarounds for a number of remaining issues, including index-based traversal and accelerator device/memory management.","PeriodicalId":321991,"journal":{"name":"2022 IEEE/ACM International Workshop on Performance Modeling, Benchmarking and Simulation of High Performance Computer Systems (PMBS)","volume":"06 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116661513","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Performance Analysis with Unified Hardware Counter Metrics","authors":"B. Gravelle, W. Nystrom, B. Norris","doi":"10.1109/PMBS56514.2022.00011","DOIUrl":"https://doi.org/10.1109/PMBS56514.2022.00011","url":null,"abstract":"Hardware performance counters provide detailed insight into the performance of applications running on modern systems, but they can be challenging to use without detailed knowledge of the computational and counter architectures. Our work addresses this challenge by identifying metrics that are common to many micro-architectures and can be directly related to the algorithms in question. These metrics, some long used and some being presented for the first time, are carefully designed to be easy to follow, informative, and portable to multiple systems. In this paper, we discuss the background of empirical performance analysis, describe our set of metrics, and demonstrate analysis on example benchmarks and mini-applications. The metrics and examples are presented on both an Intel Xeon Cascade Lake and an ARM-based Fujitsu A64FX. The significant differences in the ISAs, caches and hardware counters between these two systems demonstrate the portability of the proposed metrics. 1","PeriodicalId":321991,"journal":{"name":"2022 IEEE/ACM International Workshop on Performance Modeling, Benchmarking and Simulation of High Performance Computer Systems (PMBS)","volume":"25 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127378377","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Taylor L. Groves, C. Daley, Rahulkumar Gayatri, H. Nam, Nan Ding, Lenny Oliker, N. Wright, Samuel Williams
{"title":"A Methodology for Evaluating Tightly-integrated and Disaggregated Accelerated Architectures","authors":"Taylor L. Groves, C. Daley, Rahulkumar Gayatri, H. Nam, Nan Ding, Lenny Oliker, N. Wright, Samuel Williams","doi":"10.1109/PMBS56514.2022.00012","DOIUrl":"https://doi.org/10.1109/PMBS56514.2022.00012","url":null,"abstract":"Tighter integration of computational resources can foster superior application performance by mitigating communication bottlenecks. Unfortunately, not every application can use every compute or accelerator all the time. As a result, co-locating resources often leads to under-utilization of resources. To mitigate this challenge, architects have proposed disaggregation and ad hoc pooling of computational resources. In the next five years, HPC system architects will be presented with a spectrum of accelerated solutions ranging from tightly coupled, single package APUs to a sea of disaggregated GPUs interconnected by a global network. In this paper, we detail NEthing, our methodology and tool for evaluating the potential performance implications of such diverse architectural paradigms. We demonstrate our methodology on today’s and projected 2026 technologies for three distinct workloads: a compute-intensive kernel, a tightly-coupled HPC simulation, and an ensemble of loosely-coupled HPC simulations. Our results leverage NEthing to quantify the increased utilization disaggregated systems must achieve in order to match superior performance of APUs and on-board GPUs.","PeriodicalId":321991,"journal":{"name":"2022 IEEE/ACM International Workshop on Performance Modeling, Benchmarking and Simulation of High Performance Computer Systems (PMBS)","volume":"22 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115315561","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
I. Yamazaki, Christian A. Glusa, J. Loe, P. Luszczek, S. Rajamanickam, J. Dongarra
{"title":"High-Performance GMRES Multi-Precision Benchmark: Design, Performance, and Challenges","authors":"I. Yamazaki, Christian A. Glusa, J. Loe, P. Luszczek, S. Rajamanickam, J. Dongarra","doi":"10.1109/PMBS56514.2022.00015","DOIUrl":"https://doi.org/10.1109/PMBS56514.2022.00015","url":null,"abstract":"We propose a new benchmark for high-performance (HP) computers. Similar to High Performance Conjugate Gradient (HPCG), the new benchmark is designed to rank computers based on how fast they can solve a sparse linear system of equations, exhibiting computational and communication requirements typical in many scientific applications. The main novelty of the new benchmark is that it is now based on Generalized Minimum Residual method (GMRES) (combined with Geometric Multi-Grid preconditioner and Gauss-Seidel smoother) and provides the flexibility to utilize lower precision arithmetic. This is motivated by new hardware architectures that deliver lower-precision arithmetic at higher performance. There are other machines that do not follow this trend. However, using a lower-precision arithmetic reduces the required amount of data transfer, which alone could improve solver performance. Considering these trends, an HP benchmark that allows the use of different precisions for solving important scientific problems will be valuable for many different disciplines, and we also hope to promote the design of future HP computers that can utilize mixed-precision arithmetic for achieving high application performance. We present our initial design of the new benchmark, its reference implementation, and the performance of the reference mixed (double and single) precision Geometric Multi-Grid solvers on current top-ranked architectures. We also discuss challenges of designing such a benchmark, along with our preliminary numerical results using 16-bit numerical values (half and bfloat precisions) for solving a sparse linear system of equations.","PeriodicalId":321991,"journal":{"name":"2022 IEEE/ACM International Workshop on Performance Modeling, Benchmarking and Simulation of High Performance Computer Systems (PMBS)","volume":"11 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124001808","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"OMPICollTune: Autotuning MPI Collectives by Incremental Online Learning","authors":"S. Hunold, Sebastian Steiner","doi":"10.1109/PMBS56514.2022.00016","DOIUrl":"https://doi.org/10.1109/PMBS56514.2022.00016","url":null,"abstract":"Collective communication operations, such as Broadcast or Reduce, are fundamental cornerstones in many high-performance applications. Most collective operations can be implemented using different algorithms, each of which has advantages and disadvantages. For that reason, MPI libraries typically implement a selection logic that attempts to make good algorithmic choices for specific problem instances. It has been shown in the literature that the hard-coded algorithm selection logic found in MPI libraries can be improved by tuning the collectives in a separate, offline micro-benchmarking run.In the present paper, we go a fundamentally different way of improving the algorithm selection for MPI collectives. We integrate the probing of different algorithms directly into the MPI library. Whenever an MPI application is started with a given process configuration, i.e., the number of nodes and the processes per node, the tuner, instead of the default selection logic, finds the next algorithm to complete an issued MPI collective call. The tuner records the runtime of this MPI call for a subset of processes. With the recorded performance data, the tuner is able to build a performance model that allows selecting an efficient algorithm for a given collective problem. Subsequently recorded performance results are then used to update the performance model, where the probabilities for selecting an algorithm are adapted by the tuner, such that slow algorithms get a smaller chance of being selected. We show in a case study, using the ECP proxy application miniAMR, that our approach can effectively tune the performance of Allreduce.","PeriodicalId":321991,"journal":{"name":"2022 IEEE/ACM International Workshop on Performance Modeling, Benchmarking and Simulation of High Performance Computer Systems (PMBS)","volume":"15 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128321507","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
J. Hammond, Tom Deakin, J. Cownie, Simon McIntosh-Smith
{"title":"Benchmarking Fortran DO CONCURRENT on CPUs and GPUs Using BabelStream","authors":"J. Hammond, Tom Deakin, J. Cownie, Simon McIntosh-Smith","doi":"10.1109/PMBS56514.2022.00013","DOIUrl":"https://doi.org/10.1109/PMBS56514.2022.00013","url":null,"abstract":"Fortran DO CONCURRENT has emerged as a new way to achieve parallel execution of loops on CPUs and GPUs. This paper studies the performance portability of this construct on a range of processors and compares it with the incumbent models: OpenMP, OpenACC and CUDA. To do this study fairly, we implemented the BabelStream memory bandwidth benchmark from scratch, entirely in modern Fortran, for all of the models considered, which include Fortran DO CONCURRENT, as well as two variants of OpenACC, four variants of OpenMP (2 CPU and 2 GPU), CUDA Fortran, and both loop- and array-based references. BabelStream Fortran matches the C++ implementation as closely as possible, and can be used to make language-based comparisons. This paper represents one of the first detailed studies of the performance of Fortran support on heterogeneous architectures; we include results for AArch64 and x86_64 CPUs as well as AMD, Intel and NVIDIA GPU platforms.","PeriodicalId":321991,"journal":{"name":"2022 IEEE/ACM International Workshop on Performance Modeling, Benchmarking and Simulation of High Performance Computer Systems (PMBS)","volume":"25 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127807524","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Jan Balewski, Z. Liu, A. Tsyplikhin, Manuel Lopez Roland, Kristofer E Bouchard
{"title":"Time-series ML-regression on Graphcore IPU-M2000 and Nvidia A100","authors":"Jan Balewski, Z. Liu, A. Tsyplikhin, Manuel Lopez Roland, Kristofer E Bouchard","doi":"10.1109/PMBS56514.2022.00019","DOIUrl":"https://doi.org/10.1109/PMBS56514.2022.00019","url":null,"abstract":"We compare the ML-training performance of a Graphcore IPU-M2000-based system with Nvidia A100 GPU-based system on the Perlmutter HPC machine at NERSC/LBL. The multivariate regression of time series data from a simulated biological neuron was the scientific benchmark problem. The ML-model consisted of several convolutional, batch normalization, and fully connected layers. The training data were distributed in CPUs memory to eliminate the system dependent IO cost. The data-parallel training runs resulted in the same samples throughput on both GC200 IPUs and A100 GPUs for any choice of the number of accelerators between 1 and 256. The achieved best MSE validation loss on IPUs was only 10% to 20% larger. The aggregated energy use per 1 training epoch was between 2.5 to 3 times smaller for the Graphcore system in comparison to the Nvidia system. This paper also discusses aspects of software-hardware co-design to achieve highest efficiency on the IPU using PopTorch.","PeriodicalId":321991,"journal":{"name":"2022 IEEE/ACM International Workshop on Performance Modeling, Benchmarking and Simulation of High Performance Computer Systems (PMBS)","volume":"28 2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122995356","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
A. P. Diéguez, Min Choi, Xinran Zhu, Bryan M. Wong, K. Ibrahim
{"title":"ML-based Performance Portability for Time-Dependent Density Functional Theory in HPC Environments","authors":"A. P. Diéguez, Min Choi, Xinran Zhu, Bryan M. Wong, K. Ibrahim","doi":"10.1109/PMBS56514.2022.00006","DOIUrl":"https://doi.org/10.1109/PMBS56514.2022.00006","url":null,"abstract":"Time-Dependent Density Functional Theory (TDDFT) workloads are an example of high-impact computational methods that require leveraging the performance of HPC architectures. However, finding the optimal values of their performance-critical parameters raises performance portability challenges that must be addressed. In this work, we propose an ML-based tuning methodology based on Bayesian optimization and transfer learning to tackle the performance portability for TDDFT codes in HPC systems. Our results demonstrate the effectiveness of our transfer-learning proposal for TDDFT workloads, which reduced the number of executed evaluations by up to 86% compared to an exhaustive search for the global optimal performance parameters on the Cori and Perlmutter supercomputers. Compared to a Bayesian-optimization search, our proposal reduces the required evaluations by up to 46.7% to find the same optimal runtime configuration. Overall, this methodology can be applied to other scientific workloads for current and emerging high-performance architectures.","PeriodicalId":321991,"journal":{"name":"2022 IEEE/ACM International Workshop on Performance Modeling, Benchmarking and Simulation of High Performance Computer Systems (PMBS)","volume":"49 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134133175","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
M. Emani, Zhen Xie, Siddhisanket Raskar, V. Sastry, William Arnold, Bruce Wilson, R. Thakur, V. Vishwanath, Zhengchun Liu, M. Papka, Cindy Orozco Bohorquez, Rickey C. Weisner, K. Li, Yongning Sheng, Yun Du, Jian Zhang, A. Tsyplikhin, Gurdaman S. Khaira, J. Fowers, R. Sivakumar, Victoria Godsoe, Adrián Macías, Chetan Tekur, Matthew Boyd
{"title":"A Comprehensive Evaluation of Novel AI Accelerators for Deep Learning Workloads","authors":"M. Emani, Zhen Xie, Siddhisanket Raskar, V. Sastry, William Arnold, Bruce Wilson, R. Thakur, V. Vishwanath, Zhengchun Liu, M. Papka, Cindy Orozco Bohorquez, Rickey C. Weisner, K. Li, Yongning Sheng, Yun Du, Jian Zhang, A. Tsyplikhin, Gurdaman S. Khaira, J. Fowers, R. Sivakumar, Victoria Godsoe, Adrián Macías, Chetan Tekur, Matthew Boyd","doi":"10.1109/PMBS56514.2022.00007","DOIUrl":"https://doi.org/10.1109/PMBS56514.2022.00007","url":null,"abstract":"Scientific applications are increasingly adopting Artificial Intelligence (AI) techniques to advance science. High-performance computing centers are evaluating emerging novel hardware accelerators to efficiently run AI-driven science applications. With a wide diversity in the hardware architectures and software stacks of these systems, it is challenging to understand how these accelerators perform. The state-of-the-art in the evaluation of deep learning workloads primarily focuses on CPUs and GPUs. In this paper, we present an overview of dataflow-based novel AI accelerators from SambaNova, Cerebras, Graphcore, and Groq. We present a first-of-a-kind evaluation of these accelerators with diverse workloads, such as Deep Learning (DL) primitives, benchmark models, and scientific machine learning applications. We also evaluate the performance of collective communication, which is key for distributed DL implementation, along with a study of scaling efficiency. We then discuss key insights, challenges, and opportunities in integrating these novel AI accelerators in supercomputing systems.","PeriodicalId":321991,"journal":{"name":"2022 IEEE/ACM International Workshop on Performance Modeling, Benchmarking and Simulation of High Performance Computer Systems (PMBS)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125769099","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}