Taylor L. Groves, Benjamin Brock, Yuxin Chen, K. Ibrahim, Lenny Oliker, N. Wright, Samuel Williams, K. Yelick
{"title":"Performance Trade-offs in GPU Communication: A Study of Host and Device-initiated Approaches","authors":"Taylor L. Groves, Benjamin Brock, Yuxin Chen, K. Ibrahim, Lenny Oliker, N. Wright, Samuel Williams, K. Yelick","doi":"10.1109/PMBS51919.2020.00016","DOIUrl":"https://doi.org/10.1109/PMBS51919.2020.00016","url":null,"abstract":"Network communication on GPU-based systems is a significant roadblock for many applications with small but frequent messaging requirements. One common question for application developers is, \"How can they reduce the overheads and achieve the best communication performance on GPUs?\" This work examines device initiated versus host initiated inter-node GPU communication using NVSHMEM. We derive basic communication model parameters for single message and batched communication before validating our model against distributed GEMM benchmarks. We use our model to estimate performance benefits for applications transitioning from CPUs to GPUS for fixed-size and scaled workloads and provide general guidelines for reducing communication overheads. Our findings show that the host-initiated approach generally outperforms the device-initiated approach for the system evaluated.","PeriodicalId":383727,"journal":{"name":"2020 IEEE/ACM Performance Modeling, Benchmarking and Simulation of High Performance Computer Systems (PMBS)","volume":"37 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116968215","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
T. Nguyen, Samuel Williams, Marco Siracusa, Colin MacLean, D. Doerfler, N. Wright
{"title":"The Performance and Energy Efficiency Potential of FPGAs in Scientific Computing","authors":"T. Nguyen, Samuel Williams, Marco Siracusa, Colin MacLean, D. Doerfler, N. Wright","doi":"10.1109/PMBS51919.2020.00007","DOIUrl":"https://doi.org/10.1109/PMBS51919.2020.00007","url":null,"abstract":"Hardware specialization is a promising direction for the future of digital computing. Reconfigurable technologies enable hardware specialization with modest non-recurring engineering cost. In this paper, we use FPGAs to evaluate the benefits of building specialized hardware for numerical kernels found in scientific applications. In order to properly evaluate performance, we not only compare Intel Arria 10 and Xilinx U280 performance against Intel Xeon, Intel Xeon Phi, and NVIDIA V100 GPUs, but we also extend the Empirical Roofline Toolkit (ERT) to FPGAs in order to assess our results in terms of the Roofline Model. Although FPGA performance is known to be far less than that of a GPU, we also benchmark the energy efficiency of each platform for the scientific kernels comparing to microbenchmark and technological limits. Results show that while FPGAs struggle to compete in absolute terms with GPUs on memory- and compute-intensive kernels, they require far less power and can deliver nearly the same energy efficiency.","PeriodicalId":383727,"journal":{"name":"2020 IEEE/ACM Performance Modeling, Benchmarking and Simulation of High Performance Computer Systems (PMBS)","volume":"12 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132346595","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"[Copyright notice]","authors":"","doi":"10.1109/pmbs51919.2020.00002","DOIUrl":"https://doi.org/10.1109/pmbs51919.2020.00002","url":null,"abstract":"","PeriodicalId":383727,"journal":{"name":"2020 IEEE/ACM Performance Modeling, Benchmarking and Simulation of High Performance Computer Systems (PMBS)","volume":"14 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134594762","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
H. Anzt, Y. M. Tsai, A. Abdelfattah, T. Cojean, J. Dongarra
{"title":"Evaluating the Performance of NVIDIA’s A100 Ampere GPU for Sparse and Batched Computations","authors":"H. Anzt, Y. M. Tsai, A. Abdelfattah, T. Cojean, J. Dongarra","doi":"10.1109/PMBS51919.2020.00009","DOIUrl":"https://doi.org/10.1109/PMBS51919.2020.00009","url":null,"abstract":"GPU accelerators have become an Important backbone for scientific high performance-computing, and the performance advances obtained from adopting new GPU hardware are significant. In this paper we take a first look at NVIDIA’s newest server-line GPU, the A100 architecture, part of the Ampere generation. Specifically, we assess its performance for sparse and batch computations, as these routines are relied upon in many scientific applications, and compare to the performance achieved on NVIDIA’s previous server-line GPU.","PeriodicalId":383727,"journal":{"name":"2020 IEEE/ACM Performance Modeling, Benchmarking and Simulation of High Performance Computer Systems (PMBS)","volume":"259 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115953429","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Benchmarking Julia’s Communication Performance: Is Julia HPC ready or Full HPC?","authors":"S. Hunold, Sebastian Steiner","doi":"10.1109/PMBS51919.2020.00008","DOIUrl":"https://doi.org/10.1109/PMBS51919.2020.00008","url":null,"abstract":"Julia has quickly become one of the main programming languages for computational sciences, mainly due to its speed and flexibility. The speed and efficiency of Julia are the main reasons why researchers in the field of High Performance Computing have started porting their applications to Julia.Since Julia has a very small binding-overhead to C, many efficient computational kernels can be integrated into Julia without any noticeable performance drop. For that reason, highly tuned libraries, such as the Intel MKL or OpenBLAS, will allow Julia applications to achieve similar computational performance as their C counterparts. Yet, two questions remain: 1) How fast is Julia for memory-bound applications? 2) How efficient can MPI functions be called from a Julia application?In this paper, we will assess the performance of Julia with respect to HPC. To that end, we examine the raw throughput achievable with Julia using a new Julia port of the well-known STREAM benchmark. We also compare the running times of the most commonly used MPI collective operations (e.g., MPI_Allreduce) with their C counterparts. Our analysis shows that HPC performance of Julia is on-par with C in the majority of cases.","PeriodicalId":383727,"journal":{"name":"2020 IEEE/ACM Performance Modeling, Benchmarking and Simulation of High Performance Computer Systems (PMBS)","volume":"9 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116956595","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Jered Dominguez-Trujillo, Keira Haskins, S. J. Khouzani, Chris Leap, Sahba Tashakkori, Quincy Wofford, Trilce Estrada, P. Bridges, Patrick M. Widener
{"title":"Lightweight Measurement and Analysis of HPC Performance Variability","authors":"Jered Dominguez-Trujillo, Keira Haskins, S. J. Khouzani, Chris Leap, Sahba Tashakkori, Quincy Wofford, Trilce Estrada, P. Bridges, Patrick M. Widener","doi":"10.1109/PMBS51919.2020.00011","DOIUrl":"https://doi.org/10.1109/PMBS51919.2020.00011","url":null,"abstract":"Performance variation deriving from hardware and software sources is common in modern scientific and data-intensive computing systems, and synchronization in parallel and distributed programs often exacerbates their impacts at scale. The decentralized and emergent effects of such variation are, unfortunately, also difficult to systematically measure, analyze, and predict; modeling assumptions which are stringent enough to make analysis tractable frequently cannot be guaranteed at meaningful application scales, and longitudinal methods at such scales can require the capture and manipulation of impractically large amounts of data. This paper describes a new, scalable, and statistically robust approach for effective modeling, measurement, and analysis of large-scale performance variation in HPC systems. Our approach avoids the need to reason about complex distributions of runtimes among large numbers of individual application processes by focusing instead on the maximum length of distributed workload intervals. We describe this approach and its implementation in MPI which makes it applicable to a diverse set of HPC workloads. We also present evaluations of these techniques for quantifying and predicting performance variation carried out on large-scale computing systems, and discuss the strengths and limitations of the underlying modeling assumptions.","PeriodicalId":383727,"journal":{"name":"2020 IEEE/ACM Performance Modeling, Benchmarking and Simulation of High Performance Computer Systems (PMBS)","volume":"374 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124678063","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Richard O. Kirk, M. Nolten, R. Kevis, T. Law, S. Maheswaran, Steven A. Wright, S. Powell, G. Mudalige, S. Jarvis
{"title":"Warwick Data Store: A Data Structure Abstraction Library","authors":"Richard O. Kirk, M. Nolten, R. Kevis, T. Law, S. Maheswaran, Steven A. Wright, S. Powell, G. Mudalige, S. Jarvis","doi":"10.1109/PMBS51919.2020.00013","DOIUrl":"https://doi.org/10.1109/PMBS51919.2020.00013","url":null,"abstract":"With the increasing complexity of memory architectures and scientific applications, developing data structures that are performant, portable, scalable, and support developer productivity, is a challenging task. In this paper, we present Warwick Data Store (WDS), a lightweight and extensible C++ template library designed to manage these complexities and allow rapid prototyping. WDS is designed to abstract details of the underlying data structures away from the user, thus easing application development and optimisation. We show that using WDS does not significantly impact achieved performance across a variety of different scientific benchmarks and proxy-applications, compilers, and different architectures. The overheads are largely below 30% for smaller problems, with the overhead deceasing to below 10% when using larger problems. This shows that the library does not significantly impact the performance, while providing additional functionality to data structures, and the ability to optimise data structures without changing the application code.","PeriodicalId":383727,"journal":{"name":"2020 IEEE/ACM Performance Modeling, Benchmarking and Simulation of High Performance Computer Systems (PMBS)","volume":"141 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122344356","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Developing Models for the Runtime of Programs With Exponential Runtime Behavior","authors":"Michael Burger, Giang Nam Nguyen, C. Bischof","doi":"10.1109/PMBS51919.2020.00015","DOIUrl":"https://doi.org/10.1109/PMBS51919.2020.00015","url":null,"abstract":"In this paper, we present a new approach to generate runtime models for programs whose runtime grows exponentially with the value of one input parameter. Such programs are, e.g., of high interest for cryptanalysis to analyze practical security of traditional and post-quantum secure schemes. The model generation approach on the base of profiled training runs is built on ideas realized in the open source tool Extra-P, extended with a new class of model functions and a shared-memory parallel simulated annealing approach to heuristically determine coefficients for the model functions. Our approach is implemented in the open source software SimAnMo (Simulated Annealing Modeler). We demonstrate on various theoretical and synthetic, practical test cases that our approach delivers very accurate models and reliable predictions, compared to standard approaches on x86 and ARM architectures. SimAnMo is also employed to generate models of four codes which are employed to solve the so-called shortest vector problem. This is an important problem from the field of lattice-based cryptography. We demonstrate the quality of our models with measurements for higher lattice dimensions, as far as it is feasible. Additionally, we highlight inherent problems with models for algorithms with exponential runtime.","PeriodicalId":383727,"journal":{"name":"2020 IEEE/ACM Performance Modeling, Benchmarking and Simulation of High Performance Computer Systems (PMBS)","volume":"25 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116067413","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
C. Alappat, Jan Laukemann, T. Gruber, G. Hager, G. Wellein, N. Meyer, T. Wettig
{"title":"Performance Modeling of Streaming Kernels and Sparse Matrix-Vector Multiplication on A64FX","authors":"C. Alappat, Jan Laukemann, T. Gruber, G. Hager, G. Wellein, N. Meyer, T. Wettig","doi":"10.1109/PMBS51919.2020.00006","DOIUrl":"https://doi.org/10.1109/PMBS51919.2020.00006","url":null,"abstract":"The A64FX CPU powers the current #1 supercomputer on the Top500 list. Although it is a traditional cache-based multicore processor, its peak performance and memory bandwidth rival accelerator devices. Generating efficient code for such a new architecture requires a good understanding of its performance features. Using these features, we construct the Execution-Cache-Memory (ECM) performance model for the A64FX processor in the FX700 supercomputer and validate it using streaming loops. We also identify architectural peculiarities and derive optimization hints. Applying the ECM model to sparse matrix-vector multiplication (SpMV), we motivate why the CRS matrix storage format is inappropriate and how the SELL-C-σ format with suitable code optimizations can achieve bandwidth saturation for SpMV.","PeriodicalId":383727,"journal":{"name":"2020 IEEE/ACM Performance Modeling, Benchmarking and Simulation of High Performance Computer Systems (PMBS)","volume":"272 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-09-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115896991","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Message from the Workshop Chairs","authors":"K. Ong, K. Smith‐Miles, Vincent C. S. Lee, W. Ng","doi":"10.1109/AIDM.2006.11","DOIUrl":"https://doi.org/10.1109/AIDM.2006.11","url":null,"abstract":"","PeriodicalId":383727,"journal":{"name":"2020 IEEE/ACM Performance Modeling, Benchmarking and Simulation of High Performance Computer Systems (PMBS)","volume":"58 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130262378","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}