- Book学术

发布求助

文献互助智能选刊最新文献

Sixth IEEE International Symposium on Cluster Computing and the Grid (CCGRID'06) Pub Date : 2006-05-16 DOI:10.1109/CCGRID.2006.125

D. Mogilevsky, G. Koenig, W. Yurcik

引用次数: 13

摘要

最近高性能计算的转变增加了围绕廉价商品处理器构建的集群的使用。典型的集群由单个节点组成，其中包含一个或多个处理器，通过高带宽、低延迟的互连连接在一起。使用集群进行计算有很多好处，但也有一些缺点，包括由于涉及的组件数量太多，平均故障间隔时间(MTTF)往往较低。最近，人们提出并开发了许多容错技术来减轻集群固有的不可靠性。然而，这些技术无法解决检测非明显故障的问题，特别是拜占庭故障。目前，有效检测拜占庭式故障是一个有待解决的问题。我们描述了ByzwATCh的操作，这是一个用于运行时检测拜占庭硬件错误的模块，是Charm++并行编程框架的一部分

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Byzantine Anomaly Testing for Charm++: Providing Fault Tolerance and Survivability for Charm++ Empowered Clusters

Recently shifts in high-performance computing have increased the use of clusters built around cheap commodity processors. A typical cluster consists of individual nodes, containing one or several processors, connected together with a high-bandwidth, low-latency interconnect. There are many benefits to using clusters for computation, but also some drawbacks, including a tendency to exhibit low Mean Time To Failure (MTTF) due to the sheer number of components involved. Recently, a number of fault-tolerance techniques have been proposed and developed to mitigate the inherent unreliability of clusters. These techniques, however, fail to address the issue of detecting non-obvious faults, particularly Byzantine faults. At present, effectively detecting Byzantine faults is an open problem. We describe the operation of ByzwATCh, a module for run-time detecting Byzantine hardware errors as part of the Charm++ parallel programming framework

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Sixth IEEE International Symposium on Cluster Computing and the Grid (CCGRID'06)

自引率

0.00%

发文量