Sameh Abdulah, Allison H. Baker, George Bosilca, Qinglei Cao, Stefano Castruccio, Marc G. Genton, David E. Keyes, Zubair Khalid, Hatem Ltaief, Yan Song, Georgiy L. Stenchikov, Ying Sun
{"title":"Boosting Earth System Model Outputs And Saving PetaBytes in their Storage Using Exascale Climate Emulators","authors":"Sameh Abdulah, Allison H. Baker, George Bosilca, Qinglei Cao, Stefano Castruccio, Marc G. Genton, David E. Keyes, Zubair Khalid, Hatem Ltaief, Yan Song, Georgiy L. Stenchikov, Ying Sun","doi":"arxiv-2408.04440","DOIUrl":null,"url":null,"abstract":"We present the design and scalable implementation of an exascale climate\nemulator for addressing the escalating computational and storage requirements\nof high-resolution Earth System Model simulations. We utilize the spherical\nharmonic transform to stochastically model spatio-temporal variations in\nclimate data. This provides tunable spatio-temporal resolution and\nsignificantly improves the fidelity and granularity of climate emulation,\nachieving an ultra-high spatial resolution of 0.034 (approximately 3.5 km) in\nspace. Our emulator, trained on 318 billion hourly temperature data points from\na 35-year and 31 billion daily data points from an 83-year global simulation\nensemble, generates statistically consistent climate emulations. We extend\nlinear solver software to mixed-precision arithmetic GPUs, applying different\nprecisions within a single solver to adapt to different correlation strengths.\nThe PaRSEC runtime system supports efficient parallel matrix operations by\noptimizing the dynamic balance between computation, communication, and memory\nrequirements. Our BLAS3-rich code is optimized for systems equipped with four\ndifferent families and generations of GPUs, scaling well to achieve 0.976\nEFlop/s on 9,025 nodes (36,100 AMD MI250X multichip module (MCM) GPUs) of\nFrontier (nearly full system), 0.739 EFlop/s on 1,936 nodes (7,744 Grace-Hopper\nSuperchips (GH200)) of Alps, 0.243 EFlop/s on 1,024 nodes (4,096 A100 GPUs) of\nLeonardo, and 0.375 EFlop/s on 3,072 nodes (18,432 V100 GPUs) of Summit.","PeriodicalId":501215,"journal":{"name":"arXiv - STAT - Computation","volume":"26 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-08-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - STAT - Computation","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2408.04440","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
We present the design and scalable implementation of an exascale climate
emulator for addressing the escalating computational and storage requirements
of high-resolution Earth System Model simulations. We utilize the spherical
harmonic transform to stochastically model spatio-temporal variations in
climate data. This provides tunable spatio-temporal resolution and
significantly improves the fidelity and granularity of climate emulation,
achieving an ultra-high spatial resolution of 0.034 (approximately 3.5 km) in
space. Our emulator, trained on 318 billion hourly temperature data points from
a 35-year and 31 billion daily data points from an 83-year global simulation
ensemble, generates statistically consistent climate emulations. We extend
linear solver software to mixed-precision arithmetic GPUs, applying different
precisions within a single solver to adapt to different correlation strengths.
The PaRSEC runtime system supports efficient parallel matrix operations by
optimizing the dynamic balance between computation, communication, and memory
requirements. Our BLAS3-rich code is optimized for systems equipped with four
different families and generations of GPUs, scaling well to achieve 0.976
EFlop/s on 9,025 nodes (36,100 AMD MI250X multichip module (MCM) GPUs) of
Frontier (nearly full system), 0.739 EFlop/s on 1,936 nodes (7,744 Grace-Hopper
Superchips (GH200)) of Alps, 0.243 EFlop/s on 1,024 nodes (4,096 A100 GPUs) of
Leonardo, and 0.375 EFlop/s on 3,072 nodes (18,432 V100 GPUs) of Summit.