实现有弹性和高能效的可扩展克雷洛夫求解器

IF 2.1 4区计算机科学 Q2 COMPUTER SCIENCE, THEORY & METHODS

Parallel Computing Pub Date : 2024-11-13 DOI:10.1016/j.parco.2024.103122

Zheng Miao , Jon C. Calhoun , Rong Ge

{"title":"实现有弹性和高能效的可扩展克雷洛夫求解器","authors":"Zheng Miao , Jon C. Calhoun , Rong Ge","doi":"10.1016/j.parco.2024.103122","DOIUrl":null,"url":null,"abstract":"<div><div>Exascale computing must simultaneously address both energy efficiency and resilience as power limits impact scalability and faults are more common. Unfortunately, energy efficiency and resilience have been traditionally studied in isolation and optimizing one typically detrimentally impacts the other. To deliver the promised performance within the given power budget, exascale computing mandates a deep understanding of the interplay among energy efficiency, resilience, and scalability. In this work, we propose novel methods to analyze and optimize the costs of common resilience techniques including checkpoint-restart and forward recovery. We focus on sparse linear solvers as they are the fundamental kernels in many scientific applications. In particular, we present generalized analytical and experimental methods to analyze and quantify the time and energy costs of various recovery schemes on computer clusters, and develop and prototype performance optimization and power management strategies to improve energy efficiency. Moreover, we take a deep dive into the forward recovery that recently started to draw attention from researchers, and propose a practical matrix-aware optimization technique to reduce its recovery time. This work shows that while the time and energy costs of various resilience techniques are different, they share the common components and can be quantitatively evaluated with a generalized framework. This analysis framework can be used to guide the design of performance and energy optimization technologies. While each resilience technique has its advantages depending on the fault rate, system size, and power budget, the forward recovery can further benefit from matrix-aware optimizations for large-scale computing.</div></div>","PeriodicalId":54642,"journal":{"name":"Parallel Computing","volume":"123 ","pages":"Article 103122"},"PeriodicalIF":2.1000,"publicationDate":"2024-11-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Towards resilient and energy efficient scalable Krylov solvers\",\"authors\":\"Zheng Miao , Jon C. Calhoun , Rong Ge\",\"doi\":\"10.1016/j.parco.2024.103122\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><div>Exascale computing must simultaneously address both energy efficiency and resilience as power limits impact scalability and faults are more common. Unfortunately, energy efficiency and resilience have been traditionally studied in isolation and optimizing one typically detrimentally impacts the other. To deliver the promised performance within the given power budget, exascale computing mandates a deep understanding of the interplay among energy efficiency, resilience, and scalability. In this work, we propose novel methods to analyze and optimize the costs of common resilience techniques including checkpoint-restart and forward recovery. We focus on sparse linear solvers as they are the fundamental kernels in many scientific applications. In particular, we present generalized analytical and experimental methods to analyze and quantify the time and energy costs of various recovery schemes on computer clusters, and develop and prototype performance optimization and power management strategies to improve energy efficiency. Moreover, we take a deep dive into the forward recovery that recently started to draw attention from researchers, and propose a practical matrix-aware optimization technique to reduce its recovery time. This work shows that while the time and energy costs of various resilience techniques are different, they share the common components and can be quantitatively evaluated with a generalized framework. This analysis framework can be used to guide the design of performance and energy optimization technologies. While each resilience technique has its advantages depending on the fault rate, system size, and power budget, the forward recovery can further benefit from matrix-aware optimizations for large-scale computing.</div></div>\",\"PeriodicalId\":54642,\"journal\":{\"name\":\"Parallel Computing\",\"volume\":\"123 \",\"pages\":\"Article 103122\"},\"PeriodicalIF\":2.1000,\"publicationDate\":\"2024-11-13\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Parallel Computing\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S0167819124000607\",\"RegionNum\":4,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"COMPUTER SCIENCE, THEORY & METHODS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Parallel Computing","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0167819124000607","RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, THEORY & METHODS","Score":null,"Total":0}

引用次数: 0

摘要

超大规模计算必须同时解决能效和弹性问题，因为功率限制会影响可扩展性，故障也会更加常见。遗憾的是，能效和弹性传统上都是孤立研究的，优化其中一个通常会对另一个产生不利影响。为了在给定的功率预算内实现承诺的性能，超大规模计算要求深入了解能效、弹性和可扩展性之间的相互作用。在这项工作中，我们提出了分析和优化常见弹性技术成本的新方法，包括检查点重启和前向恢复。我们的重点是稀疏线性求解器，因为它们是许多科学应用中的基本内核。特别是，我们提出了通用的分析和实验方法，用于分析和量化计算机集群上各种恢复方案的时间和能源成本，并开发和原型了性能优化和电源管理策略，以提高能源效率。此外，我们还深入研究了最近开始引起研究人员关注的前向恢复，并提出了一种实用的矩阵感知优化技术，以缩短其恢复时间。这项工作表明，虽然各种恢复技术的时间和能源成本各不相同，但它们都有共同的组成部分，可以通过一个通用框架进行定量评估。这一分析框架可用于指导性能和能源优化技术的设计。虽然每种恢复技术都有其优势，但取决于故障率、系统规模和功率预算，前向恢复可进一步受益于大规模计算的矩阵感知优化。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Towards resilient and energy efficient scalable Krylov solvers

Exascale computing must simultaneously address both energy efficiency and resilience as power limits impact scalability and faults are more common. Unfortunately, energy efficiency and resilience have been traditionally studied in isolation and optimizing one typically detrimentally impacts the other. To deliver the promised performance within the given power budget, exascale computing mandates a deep understanding of the interplay among energy efficiency, resilience, and scalability. In this work, we propose novel methods to analyze and optimize the costs of common resilience techniques including checkpoint-restart and forward recovery. We focus on sparse linear solvers as they are the fundamental kernels in many scientific applications. In particular, we present generalized analytical and experimental methods to analyze and quantify the time and energy costs of various recovery schemes on computer clusters, and develop and prototype performance optimization and power management strategies to improve energy efficiency. Moreover, we take a deep dive into the forward recovery that recently started to draw attention from researchers, and propose a practical matrix-aware optimization technique to reduce its recovery time. This work shows that while the time and energy costs of various resilience techniques are different, they share the common components and can be quantitatively evaluated with a generalized framework. This analysis framework can be used to guide the design of performance and energy optimization technologies. While each resilience technique has its advantages depending on the fault rate, system size, and power budget, the forward recovery can further benefit from matrix-aware optimizations for large-scale computing.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Parallel Computing 工程技术-计算机：理论方法

CiteScore

3.50

自引率

7.10%

发文量

审稿时长

4.5 months

期刊介绍： Parallel Computing is an international journal presenting the practical use of parallel computer systems, including high performance architecture, system software, programming systems and tools, and applications. Within this context the journal covers all aspects of high-end parallel computing from single homogeneous or heterogenous computing nodes to large-scale multi-node systems. Parallel Computing features original research work and review articles as well as novel or illustrative accounts of application experience with (and techniques for) the use of parallel computers. We also welcome studies reproducing prior publications that either confirm or disprove prior published results. Particular technical areas of interest include, but are not limited to: -System software for parallel computer systems including programming languages (new languages as well as compilation techniques), operating systems (including middleware), and resource management (scheduling and load-balancing). -Enabling software including debuggers, performance tools, and system and numeric libraries. -General hardware (architecture) concepts, new technologies enabling the realization of such new concepts, and details of commercially available systems -Software engineering and productivity as it relates to parallel computing -Applications (including scientific computing, deep learning, machine learning) or tool case studies demonstrating novel ways to achieve parallelism -Performance measurement results on state-of-the-art systems -Approaches to effectively utilize large-scale parallel computing including new algorithms or algorithm analysis with demonstrated relevance to real applications using existing or next generation parallel computer architectures. -Parallel I/O systems both hardware and software -Networking technology for support of high-speed computing demonstrating the impact of high-speed computation on parallel applications