Fault tolerance in heterogeneous multi-cluster systems through a task migration mechanism

2014 11th International Conference on Electrical Engineering, Computing Science and Automatic Control (CCE) Pub Date : 2014-12-08 DOI:10.1109/ICEEE.2014.6978266

Uriel Cabello, José Rodríguez, A. Viveros, S. Mendoza, D. Decouchant

引用次数: 8

Abstract

The GRID computing paradigm consists of multiple heterogeneous distributed clusters connected by heterogeneous network interfaces. One advantage of this paradigm is to analyze massive amounts of data employing computing resources at different geographic places with different platforms. However in order to harness the power of those resources, many problems must be solved. In this work we deal with the problem of fault tolerance on heterogeneous computer systems. Our proposal aims to ease the process of recovery when system failures are detected at runtime avoiding the necessity for application restarts. Our proposal works through a set of services that performs transparent task migration over the computing nodes, hiding the complexity related with error handling when a hybrid programming model based on Open MPI and OpenCL is employed.

查看原文本刊更多论文

通过任务迁移机制实现异构多集群系统的容错

网格计算范式由多个异构分布式集群组成，这些集群由异构网络接口连接。这种范例的一个优点是可以使用不同地理位置和不同平台上的计算资源来分析大量数据。然而，为了利用这些资源的力量，必须解决许多问题。本文主要研究异构计算机系统的容错问题。我们的建议旨在简化在运行时检测到系统故障时的恢复过程，避免重新启动应用程序的必要性。我们的建议通过一组在计算节点上执行透明任务迁移的服务来工作，当使用基于Open MPI和OpenCL的混合编程模型时，隐藏了与错误处理相关的复杂性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2014 11th International Conference on Electrical Engineering, Computing Science and Automatic Control (CCE)

自引率

0.00%

发文量