Sameh Abdulah, Allison H. Baker, George Bosilca, Qinglei Cao, Stefano Castruccio, Marc G. Genton, David E. Keyes, Zubair Khalid, Hatem Ltaief, Yan Song, Georgiy L. Stenchikov, Ying Sun
{"title":"利用超大规模气候模拟器提升地球系统模型输出并节省 PetaBytes 的存储空间","authors":"Sameh Abdulah, Allison H. Baker, George Bosilca, Qinglei Cao, Stefano Castruccio, Marc G. Genton, David E. Keyes, Zubair Khalid, Hatem Ltaief, Yan Song, Georgiy L. Stenchikov, Ying Sun","doi":"arxiv-2408.04440","DOIUrl":null,"url":null,"abstract":"We present the design and scalable implementation of an exascale climate\nemulator for addressing the escalating computational and storage requirements\nof high-resolution Earth System Model simulations. We utilize the spherical\nharmonic transform to stochastically model spatio-temporal variations in\nclimate data. This provides tunable spatio-temporal resolution and\nsignificantly improves the fidelity and granularity of climate emulation,\nachieving an ultra-high spatial resolution of 0.034 (approximately 3.5 km) in\nspace. Our emulator, trained on 318 billion hourly temperature data points from\na 35-year and 31 billion daily data points from an 83-year global simulation\nensemble, generates statistically consistent climate emulations. We extend\nlinear solver software to mixed-precision arithmetic GPUs, applying different\nprecisions within a single solver to adapt to different correlation strengths.\nThe PaRSEC runtime system supports efficient parallel matrix operations by\noptimizing the dynamic balance between computation, communication, and memory\nrequirements. Our BLAS3-rich code is optimized for systems equipped with four\ndifferent families and generations of GPUs, scaling well to achieve 0.976\nEFlop/s on 9,025 nodes (36,100 AMD MI250X multichip module (MCM) GPUs) of\nFrontier (nearly full system), 0.739 EFlop/s on 1,936 nodes (7,744 Grace-Hopper\nSuperchips (GH200)) of Alps, 0.243 EFlop/s on 1,024 nodes (4,096 A100 GPUs) of\nLeonardo, and 0.375 EFlop/s on 3,072 nodes (18,432 V100 GPUs) of Summit.","PeriodicalId":501215,"journal":{"name":"arXiv - STAT - Computation","volume":"26 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-08-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Boosting Earth System Model Outputs And Saving PetaBytes in their Storage Using Exascale Climate Emulators\",\"authors\":\"Sameh Abdulah, Allison H. Baker, George Bosilca, Qinglei Cao, Stefano Castruccio, Marc G. Genton, David E. Keyes, Zubair Khalid, Hatem Ltaief, Yan Song, Georgiy L. Stenchikov, Ying Sun\",\"doi\":\"arxiv-2408.04440\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"We present the design and scalable implementation of an exascale climate\\nemulator for addressing the escalating computational and storage requirements\\nof high-resolution Earth System Model simulations. We utilize the spherical\\nharmonic transform to stochastically model spatio-temporal variations in\\nclimate data. This provides tunable spatio-temporal resolution and\\nsignificantly improves the fidelity and granularity of climate emulation,\\nachieving an ultra-high spatial resolution of 0.034 (approximately 3.5 km) in\\nspace. Our emulator, trained on 318 billion hourly temperature data points from\\na 35-year and 31 billion daily data points from an 83-year global simulation\\nensemble, generates statistically consistent climate emulations. We extend\\nlinear solver software to mixed-precision arithmetic GPUs, applying different\\nprecisions within a single solver to adapt to different correlation strengths.\\nThe PaRSEC runtime system supports efficient parallel matrix operations by\\noptimizing the dynamic balance between computation, communication, and memory\\nrequirements. Our BLAS3-rich code is optimized for systems equipped with four\\ndifferent families and generations of GPUs, scaling well to achieve 0.976\\nEFlop/s on 9,025 nodes (36,100 AMD MI250X multichip module (MCM) GPUs) of\\nFrontier (nearly full system), 0.739 EFlop/s on 1,936 nodes (7,744 Grace-Hopper\\nSuperchips (GH200)) of Alps, 0.243 EFlop/s on 1,024 nodes (4,096 A100 GPUs) of\\nLeonardo, and 0.375 EFlop/s on 3,072 nodes (18,432 V100 GPUs) of Summit.\",\"PeriodicalId\":501215,\"journal\":{\"name\":\"arXiv - STAT - Computation\",\"volume\":\"26 1\",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-08-08\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"arXiv - STAT - Computation\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/arxiv-2408.04440\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - STAT - Computation","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2408.04440","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Boosting Earth System Model Outputs And Saving PetaBytes in their Storage Using Exascale Climate Emulators
We present the design and scalable implementation of an exascale climate
emulator for addressing the escalating computational and storage requirements
of high-resolution Earth System Model simulations. We utilize the spherical
harmonic transform to stochastically model spatio-temporal variations in
climate data. This provides tunable spatio-temporal resolution and
significantly improves the fidelity and granularity of climate emulation,
achieving an ultra-high spatial resolution of 0.034 (approximately 3.5 km) in
space. Our emulator, trained on 318 billion hourly temperature data points from
a 35-year and 31 billion daily data points from an 83-year global simulation
ensemble, generates statistically consistent climate emulations. We extend
linear solver software to mixed-precision arithmetic GPUs, applying different
precisions within a single solver to adapt to different correlation strengths.
The PaRSEC runtime system supports efficient parallel matrix operations by
optimizing the dynamic balance between computation, communication, and memory
requirements. Our BLAS3-rich code is optimized for systems equipped with four
different families and generations of GPUs, scaling well to achieve 0.976
EFlop/s on 9,025 nodes (36,100 AMD MI250X multichip module (MCM) GPUs) of
Frontier (nearly full system), 0.739 EFlop/s on 1,936 nodes (7,744 Grace-Hopper
Superchips (GH200)) of Alps, 0.243 EFlop/s on 1,024 nodes (4,096 A100 GPUs) of
Leonardo, and 0.375 EFlop/s on 3,072 nodes (18,432 V100 GPUs) of Summit.