Mark Taylor, Peter M. Caldwell, Luca Bertagna, Conrad Clevenger, Aaron Donahue, J. Foucar, O. Guba, Benjamin Hillman, Noel Keen, Jayesh Krishna, Matthew Norman, S. Sreepathi, Christopher Terai, James B. White, A. Salinger, Renata B McCoy, L. R. Leung, David C. Bader, Danqing Wu
{"title":"The Simple Cloud-Resolving E3SM Atmosphere Model Running on the Frontier Exascale System","authors":"Mark Taylor, Peter M. Caldwell, Luca Bertagna, Conrad Clevenger, Aaron Donahue, J. Foucar, O. Guba, Benjamin Hillman, Noel Keen, Jayesh Krishna, Matthew Norman, S. Sreepathi, Christopher Terai, James B. White, A. Salinger, Renata B McCoy, L. R. Leung, David C. Bader, Danqing Wu","doi":"10.1145/3581784.3627044","DOIUrl":"https://doi.org/10.1145/3581784.3627044","url":null,"abstract":"We present an efficient and performance portable implementation of the Simple Cloud Resolving E3SM Atmosphere Model (SCREAM). SCREAM is a full featured atmospheric global circulation model with a nonhydrostatic dynamical core and state-of-the-art parameterizations for microphysics, moist turbulence and radiation. It has been written from scratch in C++ with the Kokkos library used to abstract the on-node execution model for both CPUs and GPUs. SCREAM is one of only a few global atmosphere models to be ported to GPUs. As far as we know, SCREAM is the first such model to run on both AMD GPUs and NVIDIA GPUs, as well as the first to run on nearly an entire Exascale system (Frontier). On Frontier, we obtained a record setting performance of 1.26 simulated years per day for a realistic cloud resolving simulation.","PeriodicalId":124077,"journal":{"name":"Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis","volume":"43 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2023-11-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139279564","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
H. Ltaief, Yuxi Hong, Leighton Wilson, Mathias Jacquelin, Matteo Ravasi, David E. Keyes
{"title":"Scaling the “Memory Wall” for Multi-Dimensional Seismic Processing with Algebraic Compression on Cerebras CS-2 Systems","authors":"H. Ltaief, Yuxi Hong, Leighton Wilson, Mathias Jacquelin, Matteo Ravasi, David E. Keyes","doi":"10.1145/3581784.3627042","DOIUrl":"https://doi.org/10.1145/3581784.3627042","url":null,"abstract":"We exploit the high memory bandwidth of AI-customized Cerebras CS-2 systems for seismic processing. By leveraging low-rank matrix approximation, we fit memory-hungry seismic applications onto memory-austere SRAM wafer-scale hardware, thus addressing a challenge arising in many wave-equation-based algorithms that rely on Multi-Dimensional Convolution (MDC) operators. Exploiting sparsity inherent in seismic data in the frequency domain, we implement embarrassingly parallel tile low-rank matrix-vector multiplications (TLR-MVM), which account for most of the elapsed time in MDC operations, to successfully solve the Multi-Dimensional Deconvolution (MDD) inverse problem. By reducing memory footprint along with arithmetic complexity, we fit a standard seismic benchmark dataset into the small local memories of Cerebras processing elements. Deploying TLR-MVM execution onto 48 CS-2 systems in support of MDD gives a sustained memory bandwidth of 92.58PB/s on 35, 784, 000 processing elements, a significant milestone that highlights the capabilities of AI-customized architectures to enable a new generation of seismic algorithms that will empower multiple technologies of our low-carbon future.","PeriodicalId":124077,"journal":{"name":"Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis","volume":"51 2","pages":""},"PeriodicalIF":0.0,"publicationDate":"2023-11-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139280278","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Sambit Das, Bikash Kanungo, Vishal Subramanian, Gourab Panigrahi, P. Motamarri, David M. Rogers, Paul M. Zimmerman, V. Gavini
{"title":"Large-Scale Materials Modeling at Quantum Accuracy: Ab Initio Simulations of Quasicrystals and Interacting Extended Defects in Metallic Alloys","authors":"Sambit Das, Bikash Kanungo, Vishal Subramanian, Gourab Panigrahi, P. Motamarri, David M. Rogers, Paul M. Zimmerman, V. Gavini","doi":"10.1145/3581784.3627037","DOIUrl":"https://doi.org/10.1145/3581784.3627037","url":null,"abstract":"Ab initio electronic-structure has remained dichotomous between achievable accuracy and length-scale. Quantum many-body (QMB) methods realize quantum accuracy but fail to scale. Density functional theory (DFT) scales favorably but remains far from quantum accuracy. We present a framework that breaks this dichotomy by use of three interconnected modules: (i) invDFT: a methodological advance in inverse DFT linking QMB methods to DFT; (ii) MLXC: a machine-learned density functional trained with invDFT data, commensurate with quantum accuracy; (iii) DFT-FE-MLXC: an adaptive higher-order spectral finite-element (FE) based DFT implementation that integrates MLXC with efficient solver strategies and HPC innovations in FE-specific dense linear algebra, mixed-precision algorithms, and asynchronous compute-communication. We demonstrate a paradigm shift in DFT that not only provides an accuracy commensurate with QMB methods in ground-state energies, but also attains an unprecedented performance of 659.7 PFLOPS (43.1% peak FP64 performance) on 619,124 electrons using 8,000 GPU nodes of Frontier supercomputer.","PeriodicalId":124077,"journal":{"name":"Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis","volume":"7 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2023-11-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139279821","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Zanhua Huang, Kai-yuan Hou, Ankit Agrawal, Alok N. Choudhary, Robert Ross, W. Liao
{"title":"I/O in WRF: A Case Study in Modern Parallel I/O Techniques","authors":"Zanhua Huang, Kai-yuan Hou, Ankit Agrawal, Alok N. Choudhary, Robert Ross, W. Liao","doi":"10.1145/3581784.3613216","DOIUrl":"https://doi.org/10.1145/3581784.3613216","url":null,"abstract":"Large-scale parallel applications can face significant I/O performance bottlenecks, making efficient I/O crucial. This work presents a comparative study of several parallel I/O implementations in the Weather Research and Forecasting model, including PnetCDF blocking and non-blocking I/O options, netCDF4, HDF5 Log VOL, and ADIOS. For I/O methods creating files in a canonical data layout, PnetCDF's non-blocking option offers up to 2x improvement over its blocking option and up to 4.5x over HDF5 via netCDF4, demonstrating the effectiveness of the write request aggregation technique. The HDF5 Log VOL outperforms ADIOS with a 4x improvement in write performance when creating files in the log layout, although both require non-negligible time to convert the file back to canonical order for post-run analysis. From these results we extract some observations that can guide I/O strategies for modern parallel codes.","PeriodicalId":124077,"journal":{"name":"Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis","volume":"10 5","pages":""},"PeriodicalIF":0.0,"publicationDate":"2023-11-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139279870","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Shenghong Huang, Junshi Chen, Ziyu Zhang, Xiaoyu Hao, Jun Gu, Hong An, Chun Zhao, Yan Hu, Zhanming Wang, Longkui Chen, Yifan Luo, Jineng Yao, Yi Zhang, Yang Zhao, Zhihao Wang, Dongning Jia, Zhao Jin, Changming Song, Xisheng Luo, Xiaobin He, Dexun Chen
{"title":"Establishing a Modeling System in 3-km Horizontal Resolution for Global Atmospheric Circulation triggered by Submarine Volcanic Eruptions with 400 Billion Smoothed Particle Hydrodynamics","authors":"Shenghong Huang, Junshi Chen, Ziyu Zhang, Xiaoyu Hao, Jun Gu, Hong An, Chun Zhao, Yan Hu, Zhanming Wang, Longkui Chen, Yifan Luo, Jineng Yao, Yi Zhang, Yang Zhao, Zhihao Wang, Dongning Jia, Zhao Jin, Changming Song, Xisheng Luo, Xiaobin He, Dexun Chen","doi":"10.1145/3581784.3627045","DOIUrl":"https://doi.org/10.1145/3581784.3627045","url":null,"abstract":"People are increasingly concerned about how tectonic processes affect climate and vice versa. We establish a cross-sphere modeling system for volcanic eruptions and atmosphere circulation on a new Sunway supercomputer with a spatial resolution from 10m locally to 3km globally, using an improved multimedium and multiphase smoothed particle hydrodynamics (SPH) combined with a fully coupled meteorology-chemistry global atmospheric modeling scheme. We achieve 400 billion particles and 80% parallel efficiency using 39,000,000 processor cores. The simulation captures the whole dynamic process of the Tonga eruption from shock waves, earthquakes, tsunamis, mushroom clouds to the following 6--7 days of transport and diffusion of ash and water vapor, and preliminarily obtains the influence effect of full coupling of volcano, earthquake, ocean and atmosphere. This work is of great significance for deeply understanding the interaction between tectonic processes and climate change, and establishing an early warning simulation system for similar global hazard events.","PeriodicalId":124077,"journal":{"name":"Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis","volume":"31 8","pages":""},"PeriodicalIF":0.0,"publicationDate":"2023-11-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139280164","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"FORGE: Pre-Training Open Foundation Models for Science","authors":"Junqi Yin, Sajal Dash, Feiyi Wang, M. Shankar","doi":"10.1145/3581784.3613215","DOIUrl":"https://doi.org/10.1145/3581784.3613215","url":null,"abstract":"Large language models (LLMs) are poised to revolutionize the way we conduct scientific research. However, both model complexity and pre-training cost are impeding effective adoption for the wider science community. Identifying suitable scientific use cases, finding the optimal balance between model and data sizes, and scaling up model training are among the most pressing issues that need to be addressed. In this study, we provide practical solutions for building and using LLM-based foundation models targeting scientific research use cases. We present an end-to-end examination of the effectiveness of LLMs in scientific research, including their scaling behavior and computational requirements on Frontier, the first Exascale supercomputer. We have also developed for release to the scientific community a suite of open foundation models called FORGE with up to 26B parameters using 257B tokens from over 200M scientific articles, with performance either on par or superior to other state-of-the-art comparable models. We have demonstrated the use and effectiveness of FORGE on scientific downstream tasks. Our research establishes best practices that can be applied across various fields to take advantage of LLMs for scientific discovery.","PeriodicalId":124077,"journal":{"name":"Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis","volume":"65 3","pages":""},"PeriodicalIF":0.0,"publicationDate":"2023-11-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139280262","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
T. Miyoshi, A. Amemiya, S. Otsuka, Y. Maejima, James Taylor, T. Honda, Hirofumi Tomita, S. Nishizawa, Kenta Sueki, T. Yamaura, Yutaka Ishikawa, Shinsuke Satoh, T. Ushio, K. Koike, Atsuya Uno
{"title":"Big Data Assimilation: Real-time 30-second-refresh Heavy Rain Forecast Using Fugaku During Tokyo Olympics and Paralympics","authors":"T. Miyoshi, A. Amemiya, S. Otsuka, Y. Maejima, James Taylor, T. Honda, Hirofumi Tomita, S. Nishizawa, Kenta Sueki, T. Yamaura, Yutaka Ishikawa, Shinsuke Satoh, T. Ushio, K. Koike, Atsuya Uno","doi":"10.1145/3581784.3627047","DOIUrl":"https://doi.org/10.1145/3581784.3627047","url":null,"abstract":"Real-time 30-second-refresh numerical weather prediction (NWP) was performed with exclusive use of 11,580 nodes (~7%) of supercomputer Fugaku during Tokyo Olympics and Paralympics in 2021. Total 75,248 forecasts were disseminated in the 1-month period mostly stably with time-to-solution less than 3 minutes for 30-minute forecast. Japan's Big Data Assimilation (BDA) project developed the novel NWP system for precise prediction of hazardous rains toward solving the global climate crisis. Compared with typical 1-hour-refresh systems, the BDA system offered two orders of magnitude increase in problem size and revealed the effectiveness of 30-second refresh for highly nonlinear, rapidly evolving convective rains. To achieve the required time-to-solution for real-time 30-second refresh with high accuracy, the core BDA software incorporated single precision and enhanced parallel I/O with properly selected configurations of 1000 ensemble members and 500-m-mesh weather model. The massively parallel, I/O intensive real-time BDA computation demonstrated a promising future direction.","PeriodicalId":124077,"journal":{"name":"Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis","volume":"14 4","pages":""},"PeriodicalIF":0.0,"publicationDate":"2023-11-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139279855","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Niclas Jansson, Martin Karp, Adalberto Perez, T. Mukha, Yi Ju, Jiahui Liu, Szilárd Páll, Erwin Laure, T. Weinkauf, J. Schumacher, P. Schlatter, S. Markidis
{"title":"Exploring the Ultimate Regime of Turbulent Rayleigh–Bénard Convection Through Unprecedented Spectral-Element Simulations","authors":"Niclas Jansson, Martin Karp, Adalberto Perez, T. Mukha, Yi Ju, Jiahui Liu, Szilárd Páll, Erwin Laure, T. Weinkauf, J. Schumacher, P. Schlatter, S. Markidis","doi":"10.1145/3581784.3627039","DOIUrl":"https://doi.org/10.1145/3581784.3627039","url":null,"abstract":"We detail our developments in the high-fidelity spectral-element code Neko that are essential for unprecedented large-scale direct numerical simulations of fully developed turbulence. Major innovations are modular multi-backend design enabling performance portability across a wide range of GPUs and CPUs, a GPU-optimized preconditioner with task overlapping for the pressure-Poisson equation and in-situ data compression. We carry out initial runs of Rayleigh-Bénard Convection (RBC) at extreme scale on the LUMI and Leonardo supercomputers. We show how Neko is able to strongly scale to 16,384 GPUs and obtain results that are not possible without careful consideration and optimization of the entire simulation workflow. These developments in Neko will help resolving the long-standing question regarding the ultimate regime in RBC.","PeriodicalId":124077,"journal":{"name":"Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis","volume":"5 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2023-11-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139280152","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Elia Merzari, Steven Hamilton, Thomas Evans, M. Min, Paul F. Fischer, S. Kerkemeier, Jun Fang, Paul Romano, Yu-Hsiang Lan, Malachi Phillips, E. Biondo, K. Royston, Tim Warburton, Noel Chalmers, T. Rathnayake
{"title":"Exascale Multiphysics Nuclear Reactor Simulations for Advanced Designs","authors":"Elia Merzari, Steven Hamilton, Thomas Evans, M. Min, Paul F. Fischer, S. Kerkemeier, Jun Fang, Paul Romano, Yu-Hsiang Lan, Malachi Phillips, E. Biondo, K. Royston, Tim Warburton, Noel Chalmers, T. Rathnayake","doi":"10.1145/3581784.3627038","DOIUrl":"https://doi.org/10.1145/3581784.3627038","url":null,"abstract":"ENRICO is a coupled application developed under the U.S. Department of Energy's Exascale Computing Project (ECP) targeting the modeling of advanced nuclear reactors. It couples radiation transport with heat and fluid simulation, including the high-fidelity, highresolution Monte-Carlo code Shift and the Computational fluid dynamics code NekRS. NekRS is a highly-performant open-source code for simulation of incompressible and low-Mach fluid flow, heat transfer, and combustion with a particular focus on turbulent flows in complex domains. It is based on rapidly convergent high-order spectral element discretizations that feature minimal numerical dissipation and dispersion. State-of-the-art multilevel preconditioners, efficient high-order time-splitting methods, and runtime-adaptive communication strategies are built on a fast OCCA-based kernel library, libParanumal, to provide scalability and portability across the spectrum of current and future high-performance computing platforms. On Frontier, Nek5000/RS has recently achieved an unprecedented milestone in breaching over 1 billion spectral elements and 350 billion degrees of freedom. Shift has demonstrated the capability to transport upwards of 1 billion particles per second in full core nuclear reactor simulations featuring complete temperature-dependent, continuous-energy physics on Frontier. Shift achieved a weak-scaling efficiency of 97.8% on 8192 nodes of Frontier and calculated 6 reactions in 214,896 fuel pin regions below 1% statistical error yielding first-of-a-kind resolution for a Monte Carlo transport application.","PeriodicalId":124077,"journal":{"name":"Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis","volume":"90 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2023-11-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139279887","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Zane Fink, K. Parasyris, G. Georgakoudis, Harshitha Menon
{"title":"HPAC-Offload: Accelerating HPC Applications with Portable Approximate Computing on the GPU","authors":"Zane Fink, K. Parasyris, G. Georgakoudis, Harshitha Menon","doi":"10.48550/arXiv.2308.16877","DOIUrl":"https://doi.org/10.48550/arXiv.2308.16877","url":null,"abstract":"The end of Dennard scaling and the slowdown of Moore's law led to a shift in technology trends towards parallel architectures, particularly in HPC systems. To continue providing performance benefits, HPC should embrace Approximate Computing (AC), which trades application quality loss for improved performance. However, existing AC techniques have not been extensively applied and evaluated in state-of-the-art hardware architectures such as GPUs, the primary execution vehicle for HPC applications today. This paper presents HPAC-Offload, a pragma-based programming model that extends OpenMP offload applications to support AC techniques, allowing portable approximations across different GPU architectures. We conduct a comprehensive performance analysis of HPAC-Offload across GPU-accelerated HPC applications, revealing that AC techniques can significantly accelerate HPC applications (1.64x LULESH on AMD, 1.57x NVIDIA) with minimal quality loss (0.1%). Our analysis offers deep insights into the performance of GPU-based AC that guide the future development of AC algorithms and systems for these architectures.","PeriodicalId":124077,"journal":{"name":"Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis","volume":"9 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-08-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132482337","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}