International Journal of High Performance Computing Applications最新文献_第6页

AI4IO: A suite of AI-based tools for IO-aware scheduling AI4IO：一套用于IO感知调度的基于AI的工具

IF 3.1 3区计算机科学

International Journal of High Performance Computing Applications Pub Date : 2022-04-03 DOI: 10.1177/10943420221079765

Michael R. Wyatt, Stephen Herbein, T. Gamblin, M. Taufer

{"title":"AI4IO: A suite of AI-based tools for IO-aware scheduling","authors":"Michael R. Wyatt, Stephen Herbein, T. Gamblin, M. Taufer","doi":"10.1177/10943420221079765","DOIUrl":"https://doi.org/10.1177/10943420221079765","url":null,"abstract":"Traditional workload managers do not have the capacity to consider how IO contention can increase job runtime and even cause entire resource allocations to be wasted. Whether from bursts of IO demand or parallel file systems (PFS) performance degradation, IO contention must be identified and addressed to ensure maximum performance. In this paper, we present AI4IO (AI for IO), a suite of tools using AI methods to prevent and mitigate performance losses due to IO contention. AI4IO enables existing workload managers to become IO-aware. Currently, AI4IO consists of two tools: PRIONN and CanarIO. PRIONN predicts IO contention and empowers schedulers to prevent it. CanarIO mitigates the impact of IO contention when it does occur. We measure the effectiveness of AI4IO when integrated into Flux, a next-generation scheduler, for both small- and large-scale IO-intensive job workloads. Our results show that integrating AI4IO into Flux improves the workload makespan up to 6.4%, which can account for more than 18,000 node-h of saved resources per week on a production cluster in our large-scale workload.","PeriodicalId":54957,"journal":{"name":"International Journal of High Performance Computing Applications","volume":"36 1","pages":"370 - 387"},"PeriodicalIF":3.1,"publicationDate":"2022-04-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"45876634","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

An analytical performance model of generalized hierarchical scheduling 广义分层调度的分析性能模型

IF 3.1 3区计算机科学

International Journal of High Performance Computing Applications Pub Date : 2022-03-26 DOI: 10.1177/10943420211051039

Stephen Herbein, Tapasya Patki, D. Ahn, Sebastian Mobo, Clark Hathaway, Silvina Caíno-Lores, James Corbett, D. Domyancic, T. Scogland, B. D. de Supinski, M. Taufer

{"title":"An analytical performance model of generalized hierarchical scheduling","authors":"Stephen Herbein, Tapasya Patki, D. Ahn, Sebastian Mobo, Clark Hathaway, Silvina Caíno-Lores, James Corbett, D. Domyancic, T. Scogland, B. D. de Supinski, M. Taufer","doi":"10.1177/10943420211051039","DOIUrl":"https://doi.org/10.1177/10943420211051039","url":null,"abstract":"High performance computing (HPC) workflows are undergoing tumultuous changes, including an explosion in size and complexity. Despite these changes, most batch job systems still use slow, centralized schedulers. Generalized hierarchical scheduling (GHS) solves many of the challenges that face modern workflows, but GHS has not been widely adopted in HPC. A major difficulty that hinders adoption is the lack of a performance model to aid in configuring GHS for optimal performance on a given application. We propose an analytical performance model of GHS, and we validate our proposed model with four different applications on a moderately-sized system. Our validation shows that our model is extremely accurate at predicting the performance of GHS, explaining 98.7% of the variance (i.e., an R2 statistic of 0.987). Our results also support the claim that GHS overcomes scheduling throughput problems; we measured throughput improvements of up to 270× on our moderately-sized system. We then apply our performance model to a pre-exascale system, where our model predicts throughput improvements of four orders of magnitude and provides insight into optimally configuring GHS on next generation systems.","PeriodicalId":54957,"journal":{"name":"International Journal of High Performance Computing Applications","volume":"36 1","pages":"289 - 306"},"PeriodicalIF":3.1,"publicationDate":"2022-03-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"42629053","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

Development of NCL equivalent serial and parallel python routines for meteorological data analysis 用于气象数据分析的NCL等效串行和并行python例程的开发

IF 3.1 3区计算机科学

International Journal of High Performance Computing Applications Pub Date : 2022-03-22 DOI: 10.1177/10943420221077110

Jatin Gharat, B. Kumar, L. Ragha, Amit Barve, Shaik Mohammad Jeelani, J. Clyne

{"title":"Development of NCL equivalent serial and parallel python routines for meteorological data analysis","authors":"Jatin Gharat, B. Kumar, L. Ragha, Amit Barve, Shaik Mohammad Jeelani, J. Clyne","doi":"10.1177/10943420221077110","DOIUrl":"https://doi.org/10.1177/10943420221077110","url":null,"abstract":"The NCAR Command Language (NCL) is a popular scripting language used in the geoscience community for weather data analysis and visualization. Hundreds of years of data are analyzed daily using NCL to make accurate weather predictions. However, due to its sequential nature of execution, it cannot properly utilize the parallel processing power provided by High-Performance Computing systems (HPCs). Until now very few techniques have been developed to make use of the multi-core functionality of modern HPC systems on these functions. In the recent trend, open-source languages are becoming highly popular because they support major functionalities required for data analysis and parallel computing. Hence, developers of NCL have decided to adopt Python as the future scripting language for analysis and visualization and to enable the geosciences community to play an active role in its development and support. This study focuses on developing some of the widely used NCL routines in Python. To deal with the analysis of large datasets, parallel versions of these routines are developed to work within a single node and make use of multi-core CPUs to achieve parallelism. Results show high accuracy between NCL and Python outputs and the parallel versions provided good scaling compared to their sequential counterparts.","PeriodicalId":54957,"journal":{"name":"International Journal of High Performance Computing Applications","volume":"36 1","pages":"337 - 355"},"PeriodicalIF":3.1,"publicationDate":"2022-03-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"43857226","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

Productively accelerating positron emission tomography image reconstruction on graphics processing units with Julia 在图形处理单元上使用Julia有效地加速正电子发射断层成像图像重建

IF 3.1 3区计算机科学

International Journal of High Performance Computing Applications Pub Date : 2022-03-22 DOI: 10.1177/10943420211067520

Michiel Van Gendt, Tim Besard, S. Vandenberghe, B. De Sutter

{"title":"Productively accelerating positron emission tomography image reconstruction on graphics processing units with Julia","authors":"Michiel Van Gendt, Tim Besard, S. Vandenberghe, B. De Sutter","doi":"10.1177/10943420211067520","DOIUrl":"https://doi.org/10.1177/10943420211067520","url":null,"abstract":"Research in medical imaging is hampered by a lack of programming languages that support productive, flexible programming as well as high performance. In search for higher quality imaging, researchers can ideally experiment with novel algorithms using rapid-prototyping languages such as Python. However, to speed up image reconstruction, computational resources such as those of graphics processing units (GPUs) need to be used efficiently. Doing so requires re-programming the algorithms in lower-level programming languages such as CUDA C/C++ or rephrasing them in terms of existing implementations of established algorithms in libraries. The former has a detrimental impact on research productivity and requires system-level programming expertise, and the latter puts severe constraints on the flexibility to research novel algorithms. Here, we investigate the use of the Julia scientific programming language in the domain of PET image reconstruction as a means to obtain both high performance (portability) on GPUs and high programmer productivity and flexibility, all at once, without requiring expert GPU programming knowledge. Using rapid-prototyping features of Julia, we developed basic and performance-optimized GPU implementations of baseline maximum likelihood expectation maximization (MLEM) positron emission tomography (PET) image reconstruction algorithms, as well as multiple existing algorithmic extensions. Thus, we mimic the effort that researchers would have to invest to evaluate the quality and performance potential of algorithms. We evaluate the obtained performance and compare it to state-of-the-art existing implementations. We also analyse and compare the required programming effort. With the Julia implementations, performance in line with existing GPU implementations written in the low-level, unproductive programming language CUDA C is achieved, while requiring much less programming effort, even less than what is needed for much less performant CPU implementations in C++. Switching to Julia as the programming language of choice can therefore boost the productivity of research into medical imaging and deliver excellent performance at a low cost in terms of programming effort.","PeriodicalId":54957,"journal":{"name":"International Journal of High Performance Computing Applications","volume":"36 1","pages":"320 - 336"},"PeriodicalIF":3.1,"publicationDate":"2022-03-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"65398900","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Efficient high-precision integer multiplication on the GPU 高效高精度的整数乘法在GPU上

IF 3.1 3区计算机科学

International Journal of High Performance Computing Applications Pub Date : 2022-03-20 DOI: 10.1177/10943420221077964

A. P. Diéguez, M. Amor, R. Doallo, A. Nukada, S. Matsuoka

引用次数: 1

Recovering single precision accuracy from Tensor Cores while surpassing the FP32 theoretical peak performance 在超越FP32理论峰值性能的同时，从张量核心恢复单精度精度

IF 3.1 3区计算机科学

International Journal of High Performance Computing Applications Pub Date : 2022-03-07 DOI: 10.1177/10943420221090256

Hiroyuki Ootomo, Rio Yokota

{"title":"Recovering single precision accuracy from Tensor Cores while surpassing the FP32 theoretical peak performance","authors":"Hiroyuki Ootomo, Rio Yokota","doi":"10.1177/10943420221090256","DOIUrl":"https://doi.org/10.1177/10943420221090256","url":null,"abstract":"Tensor Core is a mixed-precision matrix–matrix multiplication unit on NVIDIA GPUs with a theoretical peak performance of more than 300 TFlop/s on Ampere architectures. Tensor Cores were developed in response to the high demand of dense matrix multiplication from machine learning. However, many applications in scientific computing such as preconditioners for iterative solvers and low-precision Fourier transforms can exploit these Tensor Cores. To compute a matrix multiplication on Tensor Cores, we need to convert input matrices to half-precision, which results in loss of accuracy. To avoid this, we can keep the mantissa loss in the conversion using additional half-precision variables and use them for correcting the accuracy of matrix–matrix multiplication. Even with this correction, the use of Tensor Cores yields higher throughput compared to FP32 SIMT Cores. Nevertheless, the correcting capability of this method alone is limited, and the resulting accuracy cannot match that of a matrix multiplication on FP32 SIMT Cores. We address this problem and develop a high accuracy, high performance, and low power consumption matrix–matrix multiplication implementation using Tensor Cores, which exactly matches the accuracy of FP32 SIMT Cores while achieving superior throughput. The implementation is based on NVIDIA’s CUTLASS. We found that the key to achieving this accuracy is how to deal with the rounding inside Tensor Cores and underflow probability during the correction computation. Our implementation achieves 51 TFlop/s for a limited exponent range using FP16 Tensor Cores and 33 TFlop/s for full exponent range of FP32 using TF32 Tensor Cores on NVIDIA A100 GPUs, which outperforms the theoretical FP32 SIMT Core peak performance of 19.5 TFlop/s.","PeriodicalId":54957,"journal":{"name":"International Journal of High Performance Computing Applications","volume":"36 1","pages":"475 - 491"},"PeriodicalIF":3.1,"publicationDate":"2022-03-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"42971490","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 15

Parthenon—a performance portable block-structured adaptive mesh refinement framework Parthenon——一种性能可移植的块结构自适应网格细化框架

IF 3.1 3区计算机科学

International Journal of High Performance Computing Applications Pub Date : 2022-02-24 DOI: 10.1177/10943420221143775

P. Grete, J. Dolence, J. Miller, Joshua Brown, B. Ryan, A. Gaspar, F. Glines, S. Swaminarayan, J. Lippuner, C. Solomon, G. Shipman, Christoph Junghans, Daniel Holladay, J. Stone

{"title":"Parthenon—a performance portable block-structured adaptive mesh refinement framework","authors":"P. Grete, J. Dolence, J. Miller, Joshua Brown, B. Ryan, A. Gaspar, F. Glines, S. Swaminarayan, J. Lippuner, C. Solomon, G. Shipman, Christoph Junghans, Daniel Holladay, J. Stone","doi":"10.1177/10943420221143775","DOIUrl":"https://doi.org/10.1177/10943420221143775","url":null,"abstract":"On the path to exascale the landscape of computer device architectures and corresponding programming models has become much more diverse. While various low-level performance portable programming models are available, support at the application level lacks behind. To address this issue, we present the performance portable block-structured adaptive mesh refinement (AMR) framework Parthenon, derived from the well-tested and widely used Athena++ astrophysical magnetohydrodynamics code, but generalized to serve as the foundation for a variety of downstream multi-physics codes. Parthenon adopts the Kokkos programming model, and provides various levels of abstractions from multidimensional variables, to packages defining and separating components, to launching of parallel compute kernels. Parthenon allocates all data in device memory to reduce data movement, supports the logical packing of variables and mesh blocks to reduce kernel launch overhead, and employs one-sided, asynchronous MPI calls to reduce communication overhead in multi-node simulations. Using a hydrodynamics miniapp, we demonstrate weak and strong scaling on various architectures including AMD and NVIDIA GPUs, Intel and AMD x86 CPUs, IBM Power9 CPUs, as well as Fujitsu A64FX CPUs. At the largest scale on Frontier (the first TOP500 exascale machine), the miniapp reaches a total of 1.7 × 1013 zone-cycles/s on 9216 nodes (73,728 logical GPUs) at ≈ 92 % weak scaling parallel efficiency (starting from a single node). In combination with being an open, collaborative project, this makes Parthenon an ideal framework to target exascale simulations in which the downstream developers can focus on their specific application rather than on the complexity of handling massively-parallel, device-accelerated AMR.","PeriodicalId":54957,"journal":{"name":"International Journal of High Performance Computing Applications","volume":"37 1","pages":"465 - 486"},"PeriodicalIF":3.1,"publicationDate":"2022-02-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"43605690","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 3

Large-scale ab initio simulation of light–matter interaction at the atomic scale in Fugaku Fugaku原子尺度上光与物质相互作用的大规模从头算模拟

IF 3.1 3区计算机科学

International Journal of High Performance Computing Applications Pub Date : 2022-01-02 DOI: 10.1177/10943420211065723

Yuta Hirokawa, A. Yamada, S. Yamada, M. Noda, M. Uemoto, T. Boku, K. Yabana

引用次数: 3

High Performance Computing: 8th Latin American Conference, CARLA 2021, Guadalajara, Mexico, October 6–8, 2021, Revised Selected Papers 高性能计算:第八届拉丁美洲会议，卡拉2021，瓜达拉哈拉，墨西哥，10月6-8日，2021，修订论文选集

IF 3.1 3区计算机科学

International Journal of High Performance Computing Applications Pub Date : 2022-01-01 DOI: 10.1007/978-3-031-04209-6

引用次数: 1

ExaAM: Metal additive manufacturing simulation at the fidelity of the microstructure 在微观结构保真度的金属增材制造模拟

IF 3.1 3区计算机科学

International Journal of High Performance Computing Applications Pub Date : 2022-01-01 DOI: 10.1177/10943420211042558

J. Turner, J. Belak, N. Barton, M. Bement, N. Carlson, R. Carson, S. DeWitt, J. Fattebert, N. Hodge, Z. Jibben, Wayne E. King, L. Levine, C. Newman, A. Plotkowski, B. Radhakrishnan, S. Reeve, M. Rolchigo, A. Sabau, S. Slattery, B. Stump

{"title":"ExaAM: Metal additive manufacturing simulation at the fidelity of the microstructure","authors":"J. Turner, J. Belak, N. Barton, M. Bement, N. Carlson, R. Carson, S. DeWitt, J. Fattebert, N. Hodge, Z. Jibben, Wayne E. King, L. Levine, C. Newman, A. Plotkowski, B. Radhakrishnan, S. Reeve, M. Rolchigo, A. Sabau, S. Slattery, B. Stump","doi":"10.1177/10943420211042558","DOIUrl":"https://doi.org/10.1177/10943420211042558","url":null,"abstract":"Additive manufacturing (AM), or 3D printing, of metals is transforming the fabrication of components, in part by dramatically expanding the design space, allowing optimization of shape and topology. However, although the physical processes involved in AM are similar to those of welding, a field with decades of experimental, modeling, simulation, and characterization experience, qualification of AM parts remains a challenge. The availability of exascale computational systems, particularly when combined with data-driven approaches such as machine learning, enables topology and shape optimization as well as accelerated qualification by providing process-aware, locally accurate microstructure and mechanical property models. We describe the physics components comprising the Exascale Additive Manufacturing simulation environment and report progress using highly resolved melt pool simulations to inform part-scale finite element thermomechanics simulations, drive microstructure evolution, and determine constitutive mechanical property relationships based on those microstructures using polycrystal plasticity. We report on implementation of these components for exascale computing architectures, as well as the multi-stage simulation workflow that provides a unique high-fidelity model of process–structure–property relationships for AM parts. In addition, we discuss verification and validation through collaboration with efforts such as AM-Bench, a set of benchmark test problems under development by a team led by the National Institute of Standards and Technology.","PeriodicalId":54957,"journal":{"name":"International Journal of High Performance Computing Applications","volume":"36 1","pages":"13 - 39"},"PeriodicalIF":3.1,"publicationDate":"2022-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"44573865","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 13