Jerry Watkins, Max Carlson, Kyle Shan, I. Tezaur, M. Perego, Luca Bertagna, Carolyn Kao, M. Hoffman, S. Price
{"title":"性能便携式冰盖建模与马里","authors":"Jerry Watkins, Max Carlson, Kyle Shan, I. Tezaur, M. Perego, Luca Bertagna, Carolyn Kao, M. Hoffman, S. Price","doi":"10.1177/10943420231183688","DOIUrl":null,"url":null,"abstract":"High-resolution simulations of polar ice sheets play a crucial role in the ongoing effort to develop more accurate and reliable Earth system models for probabilistic sea-level projections. These simulations often require a massive amount of memory and computation from large supercomputing clusters to provide sufficient accuracy and resolution; therefore, it has become essential to ensure performance on these platforms. Many of today’s supercomputers contain a diverse set of computing architectures and require specific programming interfaces in order to obtain optimal efficiency. In an effort to avoid architecture-specific programming and maintain productivity across platforms, the ice-sheet modeling code known as MPAS-Albany Land Ice (MALI) uses high-level abstractions to integrate Trilinos libraries and the Kokkos programming model for performance portable code across a variety of different architectures. In this article, we analyze the performance portable features of MALI via a performance analysis on current CPU-based and GPU-based supercomputers. The analysis highlights not only the performance portable improvements made in finite element assembly and multigrid preconditioning within MALI with speedups between 1.26 and 1.82x across CPU and GPU architectures but also identifies the need to further improve performance in software coupling and preconditioning on GPUs. We perform a weak scalability study and show that simulations on GPU-based machines perform 1.24–1.92x faster when utilizing the GPUs. The best performance is found in finite element assembly, which achieved a speedup of up to 8.65x and a weak scaling efficiency of 82.6% with GPUs. We additionally describe an automated performance testing framework developed for this code base using a changepoint detection method. The framework is used to make actionable decisions about performance within MALI. We provide several concrete examples of scenarios in which the framework has identified performance regressions, improvements, and algorithm differences over the course of 2 years of development.","PeriodicalId":54957,"journal":{"name":"International Journal of High Performance Computing Applications","volume":"37 1","pages":"600 - 625"},"PeriodicalIF":3.5000,"publicationDate":"2022-04-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":"{\"title\":\"Performance portable ice-sheet modeling with MALI\",\"authors\":\"Jerry Watkins, Max Carlson, Kyle Shan, I. Tezaur, M. Perego, Luca Bertagna, Carolyn Kao, M. Hoffman, S. Price\",\"doi\":\"10.1177/10943420231183688\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"High-resolution simulations of polar ice sheets play a crucial role in the ongoing effort to develop more accurate and reliable Earth system models for probabilistic sea-level projections. These simulations often require a massive amount of memory and computation from large supercomputing clusters to provide sufficient accuracy and resolution; therefore, it has become essential to ensure performance on these platforms. Many of today’s supercomputers contain a diverse set of computing architectures and require specific programming interfaces in order to obtain optimal efficiency. In an effort to avoid architecture-specific programming and maintain productivity across platforms, the ice-sheet modeling code known as MPAS-Albany Land Ice (MALI) uses high-level abstractions to integrate Trilinos libraries and the Kokkos programming model for performance portable code across a variety of different architectures. In this article, we analyze the performance portable features of MALI via a performance analysis on current CPU-based and GPU-based supercomputers. The analysis highlights not only the performance portable improvements made in finite element assembly and multigrid preconditioning within MALI with speedups between 1.26 and 1.82x across CPU and GPU architectures but also identifies the need to further improve performance in software coupling and preconditioning on GPUs. We perform a weak scalability study and show that simulations on GPU-based machines perform 1.24–1.92x faster when utilizing the GPUs. The best performance is found in finite element assembly, which achieved a speedup of up to 8.65x and a weak scaling efficiency of 82.6% with GPUs. We additionally describe an automated performance testing framework developed for this code base using a changepoint detection method. The framework is used to make actionable decisions about performance within MALI. We provide several concrete examples of scenarios in which the framework has identified performance regressions, improvements, and algorithm differences over the course of 2 years of development.\",\"PeriodicalId\":54957,\"journal\":{\"name\":\"International Journal of High Performance Computing Applications\",\"volume\":\"37 1\",\"pages\":\"600 - 625\"},\"PeriodicalIF\":3.5000,\"publicationDate\":\"2022-04-08\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"2\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"International Journal of High Performance Computing Applications\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://doi.org/10.1177/10943420231183688\",\"RegionNum\":3,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"COMPUTER SCIENCE, HARDWARE & ARCHITECTURE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"International Journal of High Performance Computing Applications","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1177/10943420231183688","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, HARDWARE & ARCHITECTURE","Score":null,"Total":0}
引用次数: 2
摘要
极地冰盖的高分辨率模拟在为概率海平面预测开发更精确和可靠的地球系统模型的持续努力中发挥着至关重要的作用。这些模拟通常需要来自大型超级计算集群的大量内存和计算来提供足够的精度和分辨率;因此,确保这些平台上的性能变得至关重要。当今的许多超级计算机都包含多种计算体系结构,并且需要特定的编程接口才能获得最佳效率。为了避免特定于体系结构的编程并保持跨平台的生产力,被称为MPAS-Albany Land Ice (MALI)的冰盖建模代码使用高级抽象来集成Trilinos库和Kokkos编程模型,以便在各种不同的体系结构中实现性能可移植代码。在本文中,我们通过对当前基于cpu和gpu的超级计算机的性能分析来分析MALI的性能可移植特性。分析不仅强调了MALI在有限元组装和多网格预处理方面的性能改进,在CPU和GPU架构上的速度提高了1.26到1.82倍,而且还确定了进一步提高GPU软件耦合和预处理性能的需求。我们进行了弱可扩展性研究,并表明在使用gpu时,基于gpu的机器上的模拟速度提高了1.24 - 1.92倍。在有限元组件上表现最好,在gpu上实现了高达8.65倍的加速和82.6%的弱缩放效率。我们还描述了一个使用变更点检测方法为这个代码库开发的自动化性能测试框架。该框架用于就马里的绩效做出可操作的决策。我们提供了几个具体的场景示例,其中框架在2年的开发过程中发现了性能回归、改进和算法差异。
High-resolution simulations of polar ice sheets play a crucial role in the ongoing effort to develop more accurate and reliable Earth system models for probabilistic sea-level projections. These simulations often require a massive amount of memory and computation from large supercomputing clusters to provide sufficient accuracy and resolution; therefore, it has become essential to ensure performance on these platforms. Many of today’s supercomputers contain a diverse set of computing architectures and require specific programming interfaces in order to obtain optimal efficiency. In an effort to avoid architecture-specific programming and maintain productivity across platforms, the ice-sheet modeling code known as MPAS-Albany Land Ice (MALI) uses high-level abstractions to integrate Trilinos libraries and the Kokkos programming model for performance portable code across a variety of different architectures. In this article, we analyze the performance portable features of MALI via a performance analysis on current CPU-based and GPU-based supercomputers. The analysis highlights not only the performance portable improvements made in finite element assembly and multigrid preconditioning within MALI with speedups between 1.26 and 1.82x across CPU and GPU architectures but also identifies the need to further improve performance in software coupling and preconditioning on GPUs. We perform a weak scalability study and show that simulations on GPU-based machines perform 1.24–1.92x faster when utilizing the GPUs. The best performance is found in finite element assembly, which achieved a speedup of up to 8.65x and a weak scaling efficiency of 82.6% with GPUs. We additionally describe an automated performance testing framework developed for this code base using a changepoint detection method. The framework is used to make actionable decisions about performance within MALI. We provide several concrete examples of scenarios in which the framework has identified performance regressions, improvements, and algorithm differences over the course of 2 years of development.
期刊介绍:
With ever increasing pressure for health services in all countries to meet rising demands, improve their quality and efficiency, and to be more accountable; the need for rigorous research and policy analysis has never been greater. The Journal of Health Services Research & Policy presents the latest scientific research, insightful overviews and reflections on underlying issues, and innovative, thought provoking contributions from leading academics and policy-makers. It provides ideas and hope for solving dilemmas that confront all countries.