Resilience and fault tolerance in high-performance computing for numerical weather and climate prediction

IF 3.5 3区 计算机科学 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE
Tommaso Benacchio, Luca Bonaventura, Mirco Altenbernd, C. Cantwell, P. Düben, M. Gillard, L. Giraud, Dominik Göddeke, E. Raffin, K. Teranishi, N. Wedi
{"title":"Resilience and fault tolerance in high-performance computing for numerical weather and climate prediction","authors":"Tommaso Benacchio, Luca Bonaventura, Mirco Altenbernd, C. Cantwell, P. Düben, M. Gillard, L. Giraud, Dominik Göddeke, E. Raffin, K. Teranishi, N. Wedi","doi":"10.1177/1094342021990433","DOIUrl":null,"url":null,"abstract":"Progress in numerical weather and climate prediction accuracy greatly depends on the growth of the available computing power. As the number of cores in top computing facilities pushes into the millions, increased average frequency of hardware and software failures forces users to review their algorithms and systems in order to protect simulations from breakdown. This report surveys hardware, application-level and algorithm-level resilience approaches of particular relevance to time-critical numerical weather and climate prediction systems. A selection of applicable existing strategies is analysed, featuring interpolation-restart and compressed checkpointing for the numerical schemes, in-memory checkpointing, user-level failure mitigation and backup-based methods for the systems. Numerical examples showcase the performance of the techniques in addressing faults, with particular emphasis on iterative solvers for linear systems, a staple of atmospheric fluid flow solvers. The potential impact of these strategies is discussed in relation to current development of numerical weather prediction algorithms and systems towards the exascale. Trade-offs between performance, efficiency and effectiveness of resiliency strategies are analysed and some recommendations outlined for future developments.","PeriodicalId":54957,"journal":{"name":"International Journal of High Performance Computing Applications","volume":"35 1","pages":"285 - 311"},"PeriodicalIF":3.5000,"publicationDate":"2021-02-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1177/1094342021990433","citationCount":"10","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"International Journal of High Performance Computing Applications","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1177/1094342021990433","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, HARDWARE & ARCHITECTURE","Score":null,"Total":0}
引用次数: 10

Abstract

Progress in numerical weather and climate prediction accuracy greatly depends on the growth of the available computing power. As the number of cores in top computing facilities pushes into the millions, increased average frequency of hardware and software failures forces users to review their algorithms and systems in order to protect simulations from breakdown. This report surveys hardware, application-level and algorithm-level resilience approaches of particular relevance to time-critical numerical weather and climate prediction systems. A selection of applicable existing strategies is analysed, featuring interpolation-restart and compressed checkpointing for the numerical schemes, in-memory checkpointing, user-level failure mitigation and backup-based methods for the systems. Numerical examples showcase the performance of the techniques in addressing faults, with particular emphasis on iterative solvers for linear systems, a staple of atmospheric fluid flow solvers. The potential impact of these strategies is discussed in relation to current development of numerical weather prediction algorithms and systems towards the exascale. Trade-offs between performance, efficiency and effectiveness of resiliency strategies are analysed and some recommendations outlined for future developments.
数值天气和气候预测的高性能计算弹性和容错
数值天气和气候预报精度的提高在很大程度上取决于可用计算能力的增长。随着顶级计算设备的核心数量达到数百万,硬件和软件故障的平均频率增加,迫使用户重新检查他们的算法和系统,以保护模拟不发生故障。本报告调查了硬件、应用级和算法级弹性方法,特别是与时间关键数值天气和气候预测系统相关的方法。分析了适用的现有策略的选择,包括数值方案的插值重新启动和压缩检查点,内存检查点,用户级故障缓解和基于备份的系统方法。数值示例展示了该技术在解决故障方面的性能,特别强调线性系统的迭代求解器,这是大气流体流动求解器的主要内容。讨论了这些策略的潜在影响,并与当前面向百亿亿次的数值天气预报算法和系统的发展有关。分析了弹性战略的绩效、效率和有效性之间的权衡,并概述了对未来发展的一些建议。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
International Journal of High Performance Computing Applications
International Journal of High Performance Computing Applications 工程技术-计算机:跨学科应用
CiteScore
6.10
自引率
6.50%
发文量
32
审稿时长
>12 weeks
期刊介绍: With ever increasing pressure for health services in all countries to meet rising demands, improve their quality and efficiency, and to be more accountable; the need for rigorous research and policy analysis has never been greater. The Journal of Health Services Research & Policy presents the latest scientific research, insightful overviews and reflections on underlying issues, and innovative, thought provoking contributions from leading academics and policy-makers. It provides ideas and hope for solving dilemmas that confront all countries.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信