Michael R. Wyatt, Stephen Herbein, T. Gamblin, M. Taufer
{"title":"AI4IO: A suite of AI-based tools for IO-aware scheduling","authors":"Michael R. Wyatt, Stephen Herbein, T. Gamblin, M. Taufer","doi":"10.1177/10943420221079765","DOIUrl":"https://doi.org/10.1177/10943420221079765","url":null,"abstract":"Traditional workload managers do not have the capacity to consider how IO contention can increase job runtime and even cause entire resource allocations to be wasted. Whether from bursts of IO demand or parallel file systems (PFS) performance degradation, IO contention must be identified and addressed to ensure maximum performance. In this paper, we present AI4IO (AI for IO), a suite of tools using AI methods to prevent and mitigate performance losses due to IO contention. AI4IO enables existing workload managers to become IO-aware. Currently, AI4IO consists of two tools: PRIONN and CanarIO. PRIONN predicts IO contention and empowers schedulers to prevent it. CanarIO mitigates the impact of IO contention when it does occur. We measure the effectiveness of AI4IO when integrated into Flux, a next-generation scheduler, for both small- and large-scale IO-intensive job workloads. Our results show that integrating AI4IO into Flux improves the workload makespan up to 6.4%, which can account for more than 18,000 node-h of saved resources per week on a production cluster in our large-scale workload.","PeriodicalId":54957,"journal":{"name":"International Journal of High Performance Computing Applications","volume":"36 1","pages":"370 - 387"},"PeriodicalIF":3.1,"publicationDate":"2022-04-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"45876634","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Stephen Herbein, Tapasya Patki, D. Ahn, Sebastian Mobo, Clark Hathaway, Silvina Caíno-Lores, James Corbett, D. Domyancic, T. Scogland, B. D. de Supinski, M. Taufer
{"title":"An analytical performance model of generalized hierarchical scheduling","authors":"Stephen Herbein, Tapasya Patki, D. Ahn, Sebastian Mobo, Clark Hathaway, Silvina Caíno-Lores, James Corbett, D. Domyancic, T. Scogland, B. D. de Supinski, M. Taufer","doi":"10.1177/10943420211051039","DOIUrl":"https://doi.org/10.1177/10943420211051039","url":null,"abstract":"High performance computing (HPC) workflows are undergoing tumultuous changes, including an explosion in size and complexity. Despite these changes, most batch job systems still use slow, centralized schedulers. Generalized hierarchical scheduling (GHS) solves many of the challenges that face modern workflows, but GHS has not been widely adopted in HPC. A major difficulty that hinders adoption is the lack of a performance model to aid in configuring GHS for optimal performance on a given application. We propose an analytical performance model of GHS, and we validate our proposed model with four different applications on a moderately-sized system. Our validation shows that our model is extremely accurate at predicting the performance of GHS, explaining 98.7% of the variance (i.e., an R2 statistic of 0.987). Our results also support the claim that GHS overcomes scheduling throughput problems; we measured throughput improvements of up to 270× on our moderately-sized system. We then apply our performance model to a pre-exascale system, where our model predicts throughput improvements of four orders of magnitude and provides insight into optimally configuring GHS on next generation systems.","PeriodicalId":54957,"journal":{"name":"International Journal of High Performance Computing Applications","volume":"36 1","pages":"289 - 306"},"PeriodicalIF":3.1,"publicationDate":"2022-03-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"42629053","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Jatin Gharat, B. Kumar, L. Ragha, Amit Barve, Shaik Mohammad Jeelani, J. Clyne
{"title":"Development of NCL equivalent serial and parallel python routines for meteorological data analysis","authors":"Jatin Gharat, B. Kumar, L. Ragha, Amit Barve, Shaik Mohammad Jeelani, J. Clyne","doi":"10.1177/10943420221077110","DOIUrl":"https://doi.org/10.1177/10943420221077110","url":null,"abstract":"The NCAR Command Language (NCL) is a popular scripting language used in the geoscience community for weather data analysis and visualization. Hundreds of years of data are analyzed daily using NCL to make accurate weather predictions. However, due to its sequential nature of execution, it cannot properly utilize the parallel processing power provided by High-Performance Computing systems (HPCs). Until now very few techniques have been developed to make use of the multi-core functionality of modern HPC systems on these functions. In the recent trend, open-source languages are becoming highly popular because they support major functionalities required for data analysis and parallel computing. Hence, developers of NCL have decided to adopt Python as the future scripting language for analysis and visualization and to enable the geosciences community to play an active role in its development and support. This study focuses on developing some of the widely used NCL routines in Python. To deal with the analysis of large datasets, parallel versions of these routines are developed to work within a single node and make use of multi-core CPUs to achieve parallelism. Results show high accuracy between NCL and Python outputs and the parallel versions provided good scaling compared to their sequential counterparts.","PeriodicalId":54957,"journal":{"name":"International Journal of High Performance Computing Applications","volume":"36 1","pages":"337 - 355"},"PeriodicalIF":3.1,"publicationDate":"2022-03-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"43857226","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Michiel Van Gendt, Tim Besard, S. Vandenberghe, B. De Sutter
{"title":"Productively accelerating positron emission tomography image reconstruction on graphics processing units with Julia","authors":"Michiel Van Gendt, Tim Besard, S. Vandenberghe, B. De Sutter","doi":"10.1177/10943420211067520","DOIUrl":"https://doi.org/10.1177/10943420211067520","url":null,"abstract":"Research in medical imaging is hampered by a lack of programming languages that support productive, flexible programming as well as high performance. In search for higher quality imaging, researchers can ideally experiment with novel algorithms using rapid-prototyping languages such as Python. However, to speed up image reconstruction, computational resources such as those of graphics processing units (GPUs) need to be used efficiently. Doing so requires re-programming the algorithms in lower-level programming languages such as CUDA C/C++ or rephrasing them in terms of existing implementations of established algorithms in libraries. The former has a detrimental impact on research productivity and requires system-level programming expertise, and the latter puts severe constraints on the flexibility to research novel algorithms. Here, we investigate the use of the Julia scientific programming language in the domain of PET image reconstruction as a means to obtain both high performance (portability) on GPUs and high programmer productivity and flexibility, all at once, without requiring expert GPU programming knowledge. Using rapid-prototyping features of Julia, we developed basic and performance-optimized GPU implementations of baseline maximum likelihood expectation maximization (MLEM) positron emission tomography (PET) image reconstruction algorithms, as well as multiple existing algorithmic extensions. Thus, we mimic the effort that researchers would have to invest to evaluate the quality and performance potential of algorithms. We evaluate the obtained performance and compare it to state-of-the-art existing implementations. We also analyse and compare the required programming effort. With the Julia implementations, performance in line with existing GPU implementations written in the low-level, unproductive programming language CUDA C is achieved, while requiring much less programming effort, even less than what is needed for much less performant CPU implementations in C++. Switching to Julia as the programming language of choice can therefore boost the productivity of research into medical imaging and deliver excellent performance at a low cost in terms of programming effort.","PeriodicalId":54957,"journal":{"name":"International Journal of High Performance Computing Applications","volume":"36 1","pages":"320 - 336"},"PeriodicalIF":3.1,"publicationDate":"2022-03-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"65398900","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
A. P. Diéguez, M. Amor, R. Doallo, A. Nukada, S. Matsuoka
{"title":"Efficient high-precision integer multiplication on the GPU","authors":"A. P. Diéguez, M. Amor, R. Doallo, A. Nukada, S. Matsuoka","doi":"10.1177/10943420221077964","DOIUrl":"https://doi.org/10.1177/10943420221077964","url":null,"abstract":"The multiplication of large integers, which has many applications in computer science, is an operation that can be expressed as a polynomial multiplication followed by a carry normalization. This work develops two approaches for efficient polynomial multiplication: one approach is based on tiling the classical convolution algorithm, but taking advantage of new CUDA architectures, a novelty approach to compute the multiplication using integers without accuracy lossless; the other one is based on the Strassen algorithm, an algorithm that multiplies large polynomials using the FFT operation, but adapting the fastest FFT libraries for current GPUs and working on the complex field. Previous studies reported that the Strassen algorithm is an effective implementation for “large enough” integers on GPUs. Additionally, most previous studies do not examine the implementation of the carry normalization, but this work describes a parallel implementation for this operation. Our results show the efficiency of our approaches for short, medium, and large sizes.","PeriodicalId":54957,"journal":{"name":"International Journal of High Performance Computing Applications","volume":"36 1","pages":"356 - 369"},"PeriodicalIF":3.1,"publicationDate":"2022-03-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"48674353","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Recovering single precision accuracy from Tensor Cores while surpassing the FP32 theoretical peak performance","authors":"Hiroyuki Ootomo, Rio Yokota","doi":"10.1177/10943420221090256","DOIUrl":"https://doi.org/10.1177/10943420221090256","url":null,"abstract":"Tensor Core is a mixed-precision matrix–matrix multiplication unit on NVIDIA GPUs with a theoretical peak performance of more than 300 TFlop/s on Ampere architectures. Tensor Cores were developed in response to the high demand of dense matrix multiplication from machine learning. However, many applications in scientific computing such as preconditioners for iterative solvers and low-precision Fourier transforms can exploit these Tensor Cores. To compute a matrix multiplication on Tensor Cores, we need to convert input matrices to half-precision, which results in loss of accuracy. To avoid this, we can keep the mantissa loss in the conversion using additional half-precision variables and use them for correcting the accuracy of matrix–matrix multiplication. Even with this correction, the use of Tensor Cores yields higher throughput compared to FP32 SIMT Cores. Nevertheless, the correcting capability of this method alone is limited, and the resulting accuracy cannot match that of a matrix multiplication on FP32 SIMT Cores. We address this problem and develop a high accuracy, high performance, and low power consumption matrix–matrix multiplication implementation using Tensor Cores, which exactly matches the accuracy of FP32 SIMT Cores while achieving superior throughput. The implementation is based on NVIDIA’s CUTLASS. We found that the key to achieving this accuracy is how to deal with the rounding inside Tensor Cores and underflow probability during the correction computation. Our implementation achieves 51 TFlop/s for a limited exponent range using FP16 Tensor Cores and 33 TFlop/s for full exponent range of FP32 using TF32 Tensor Cores on NVIDIA A100 GPUs, which outperforms the theoretical FP32 SIMT Core peak performance of 19.5 TFlop/s.","PeriodicalId":54957,"journal":{"name":"International Journal of High Performance Computing Applications","volume":"36 1","pages":"475 - 491"},"PeriodicalIF":3.1,"publicationDate":"2022-03-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"42971490","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
P. Grete, J. Dolence, J. Miller, Joshua Brown, B. Ryan, A. Gaspar, F. Glines, S. Swaminarayan, J. Lippuner, C. Solomon, G. Shipman, Christoph Junghans, Daniel Holladay, J. Stone
{"title":"Parthenon—a performance portable block-structured adaptive mesh refinement framework","authors":"P. Grete, J. Dolence, J. Miller, Joshua Brown, B. Ryan, A. Gaspar, F. Glines, S. Swaminarayan, J. Lippuner, C. Solomon, G. Shipman, Christoph Junghans, Daniel Holladay, J. Stone","doi":"10.1177/10943420221143775","DOIUrl":"https://doi.org/10.1177/10943420221143775","url":null,"abstract":"On the path to exascale the landscape of computer device architectures and corresponding programming models has become much more diverse. While various low-level performance portable programming models are available, support at the application level lacks behind. To address this issue, we present the performance portable block-structured adaptive mesh refinement (AMR) framework Parthenon, derived from the well-tested and widely used Athena++ astrophysical magnetohydrodynamics code, but generalized to serve as the foundation for a variety of downstream multi-physics codes. Parthenon adopts the Kokkos programming model, and provides various levels of abstractions from multidimensional variables, to packages defining and separating components, to launching of parallel compute kernels. Parthenon allocates all data in device memory to reduce data movement, supports the logical packing of variables and mesh blocks to reduce kernel launch overhead, and employs one-sided, asynchronous MPI calls to reduce communication overhead in multi-node simulations. Using a hydrodynamics miniapp, we demonstrate weak and strong scaling on various architectures including AMD and NVIDIA GPUs, Intel and AMD x86 CPUs, IBM Power9 CPUs, as well as Fujitsu A64FX CPUs. At the largest scale on Frontier (the first TOP500 exascale machine), the miniapp reaches a total of 1.7 × 1013 zone-cycles/s on 9216 nodes (73,728 logical GPUs) at ≈ 92 % weak scaling parallel efficiency (starting from a single node). In combination with being an open, collaborative project, this makes Parthenon an ideal framework to target exascale simulations in which the downstream developers can focus on their specific application rather than on the complexity of handling massively-parallel, device-accelerated AMR.","PeriodicalId":54957,"journal":{"name":"International Journal of High Performance Computing Applications","volume":"37 1","pages":"465 - 486"},"PeriodicalIF":3.1,"publicationDate":"2022-02-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"43605690","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yuta Hirokawa, A. Yamada, S. Yamada, M. Noda, M. Uemoto, T. Boku, K. Yabana
{"title":"Large-scale ab initio simulation of light–matter interaction at the atomic scale in Fugaku","authors":"Yuta Hirokawa, A. Yamada, S. Yamada, M. Noda, M. Uemoto, T. Boku, K. Yabana","doi":"10.1177/10943420211065723","DOIUrl":"https://doi.org/10.1177/10943420211065723","url":null,"abstract":"In the field of optical science, it is becoming increasingly important to observe and manipulate matter at the atomic scale using ultrashort pulsed light. For the first time, we have performed the ab initio simulation solving the Maxwell equation for light electromagnetic fields, the time-dependent Kohn-Sham equation for electrons, and the Newton equation for ions in extended systems. In the simulation, the most time-consuming parts were stencil and nonlocal pseudopotential operations on the electron orbitals as well as fast Fourier transforms for the electron density. Code optimization was thoroughly performed on the Fujitsu A64FX processor to achieve the highest performance. A simulation of amorphous SiO2 thin film composed of more than 10,000 atoms was performed using 27,648 nodes of the Fugaku supercomputer. The simulation achieved excellent time-to-solution with the performance close to the maximum possible value in view of the memory bandwidth bound, as well as excellent weak scalability.","PeriodicalId":54957,"journal":{"name":"International Journal of High Performance Computing Applications","volume":"36 1","pages":"182 - 197"},"PeriodicalIF":3.1,"publicationDate":"2022-01-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"46158909","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"High Performance Computing: 8th Latin American Conference, CARLA 2021, Guadalajara, Mexico, October 6–8, 2021, Revised Selected Papers","authors":"","doi":"10.1007/978-3-031-04209-6","DOIUrl":"https://doi.org/10.1007/978-3-031-04209-6","url":null,"abstract":"","PeriodicalId":54957,"journal":{"name":"International Journal of High Performance Computing Applications","volume":"77 1","pages":""},"PeriodicalIF":3.1,"publicationDate":"2022-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"87050523","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
J. Turner, J. Belak, N. Barton, M. Bement, N. Carlson, R. Carson, S. DeWitt, J. Fattebert, N. Hodge, Z. Jibben, Wayne E. King, L. Levine, C. Newman, A. Plotkowski, B. Radhakrishnan, S. Reeve, M. Rolchigo, A. Sabau, S. Slattery, B. Stump
{"title":"ExaAM: Metal additive manufacturing simulation at the fidelity of the microstructure","authors":"J. Turner, J. Belak, N. Barton, M. Bement, N. Carlson, R. Carson, S. DeWitt, J. Fattebert, N. Hodge, Z. Jibben, Wayne E. King, L. Levine, C. Newman, A. Plotkowski, B. Radhakrishnan, S. Reeve, M. Rolchigo, A. Sabau, S. Slattery, B. Stump","doi":"10.1177/10943420211042558","DOIUrl":"https://doi.org/10.1177/10943420211042558","url":null,"abstract":"Additive manufacturing (AM), or 3D printing, of metals is transforming the fabrication of components, in part by dramatically expanding the design space, allowing optimization of shape and topology. However, although the physical processes involved in AM are similar to those of welding, a field with decades of experimental, modeling, simulation, and characterization experience, qualification of AM parts remains a challenge. The availability of exascale computational systems, particularly when combined with data-driven approaches such as machine learning, enables topology and shape optimization as well as accelerated qualification by providing process-aware, locally accurate microstructure and mechanical property models. We describe the physics components comprising the Exascale Additive Manufacturing simulation environment and report progress using highly resolved melt pool simulations to inform part-scale finite element thermomechanics simulations, drive microstructure evolution, and determine constitutive mechanical property relationships based on those microstructures using polycrystal plasticity. We report on implementation of these components for exascale computing architectures, as well as the multi-stage simulation workflow that provides a unique high-fidelity model of process–structure–property relationships for AM parts. In addition, we discuss verification and validation through collaboration with efforts such as AM-Bench, a set of benchmark test problems under development by a team led by the National Institute of Standards and Technology.","PeriodicalId":54957,"journal":{"name":"International Journal of High Performance Computing Applications","volume":"36 1","pages":"13 - 39"},"PeriodicalIF":3.1,"publicationDate":"2022-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"44573865","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}