Performance evaluation of fault tolerance for parallel applications in networked environments

Proceedings of the 1997 International Conference on Parallel Processing (Cat. No.97TB100162) Pub Date : 1997-08-11 DOI:10.1109/ICPP.1997.622663

Pierre Sens

引用次数: 3

Abstract

This paper presents the performance evaluation of a software fault manager for distributed applications. Dubbed STAR, it uses the natural redundancy existing in networks of workstations to offer a high level of fault tolerance. Fault management is transparent to the supported parallel applications. STAR is application independent, highly configurable and easily portable to UNIX-like operating systems. The current implementation is based on independent checkpointing and message logging. Measurements show the efficiency and the limits of this implementation. The challenge is to show that a software approach to fault tolerance can efficiently be implemented in a standard networked environment.

查看原文本刊更多论文

网络环境下并行应用容错性能评价

提出了一种分布式应用软件故障管理器的性能评价方法。它被称为STAR，它利用工作站网络中存在的自然冗余来提供高水平的容错性。故障管理对受支持的并行应用程序是透明的。STAR独立于应用程序，高度可配置，易于移植到类unix操作系统。当前的实现基于独立的检查点和消息日志。测量表明了这种实现的效率和局限性。挑战在于如何在标准的网络环境中有效地实现软件容错方法。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Proceedings of the 1997 International Conference on Parallel Processing (Cat. No.97TB100162)

自引率

0.00%

发文量