Fail-in-Place Network Design: Interaction Between Topology, Routing Algorithm and Failures

SC14: International Conference for High Performance Computing, Networking, Storage and Analysis Pub Date : 2014-11-16 DOI:10.1109/SC.2014.54

Jens Domke, T. Hoefler, S. Matsuoka

引用次数: 31

Abstract

The growing system size of high performance computers results in a steady decrease of the mean time between failures. Exchanging network components often requires whole system downtime which increases the cost of failures. In this work, we study a fail-in-place strategy where broken network elements remain untouched. We show, that a fail-in-place strategy is feasible for todays networks and the degradation is manageable, and provide guidelines for the design. Our network failure simulation tool chain allows system designers to extrapolate the performance degradation based on expected failure rates, and it can be used to evaluate the current state of a system. In a case study of real-world HPC systems, we will analyze the performance degradation throughout the systems lifetime under the assumption that faulty network components are not repaired, which results in a recommendation to change the used routing algorithm to improve the network performance as well as the fail-in-place characteristic.

查看原文本刊更多论文

故障就地网络设计:拓扑、路由算法和故障之间的交互作用

高性能计算机系统规模的增长导致平均故障间隔时间的稳步减少。交换网络组件通常需要整个系统停机，这增加了故障的成本。在这项工作中，我们研究了一种故障就地策略，其中损坏的网络元素保持不变。我们表明，就地故障策略对于今天的网络是可行的，并且退化是可管理的，并为设计提供指导方针。我们的网络故障模拟工具链允许系统设计人员根据预期的故障率推断性能下降，并且它可以用于评估系统的当前状态。在实际HPC系统的案例研究中，我们将在假设故障网络组件无法修复的情况下，分析整个系统生命周期的性能下降，从而建议更改所使用的路由算法，以改善网络性能和故障就地特性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

SC14: International Conference for High Performance Computing, Networking, Storage and Analysis

自引率

0.00%

发文量