Fault tolerance in heterogeneous multi-cluster systems through a task migration mechanism

Uriel Cabello, José Rodríguez, A. Viveros, S. Mendoza, D. Decouchant
{"title":"Fault tolerance in heterogeneous multi-cluster systems through a task migration mechanism","authors":"Uriel Cabello, José Rodríguez, A. Viveros, S. Mendoza, D. Decouchant","doi":"10.1109/ICEEE.2014.6978266","DOIUrl":null,"url":null,"abstract":"The GRID computing paradigm consists of multiple heterogeneous distributed clusters connected by heterogeneous network interfaces. One advantage of this paradigm is to analyze massive amounts of data employing computing resources at different geographic places with different platforms. However in order to harness the power of those resources, many problems must be solved. In this work we deal with the problem of fault tolerance on heterogeneous computer systems. Our proposal aims to ease the process of recovery when system failures are detected at runtime avoiding the necessity for application restarts. Our proposal works through a set of services that performs transparent task migration over the computing nodes, hiding the complexity related with error handling when a hybrid programming model based on Open MPI and OpenCL is employed.","PeriodicalId":6661,"journal":{"name":"2014 11th International Conference on Electrical Engineering, Computing Science and Automatic Control (CCE)","volume":"51 1","pages":"1-7"},"PeriodicalIF":0.0000,"publicationDate":"2014-12-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"8","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2014 11th International Conference on Electrical Engineering, Computing Science and Automatic Control (CCE)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICEEE.2014.6978266","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 8

Abstract

The GRID computing paradigm consists of multiple heterogeneous distributed clusters connected by heterogeneous network interfaces. One advantage of this paradigm is to analyze massive amounts of data employing computing resources at different geographic places with different platforms. However in order to harness the power of those resources, many problems must be solved. In this work we deal with the problem of fault tolerance on heterogeneous computer systems. Our proposal aims to ease the process of recovery when system failures are detected at runtime avoiding the necessity for application restarts. Our proposal works through a set of services that performs transparent task migration over the computing nodes, hiding the complexity related with error handling when a hybrid programming model based on Open MPI and OpenCL is employed.
通过任务迁移机制实现异构多集群系统的容错
网格计算范式由多个异构分布式集群组成,这些集群由异构网络接口连接。这种范例的一个优点是可以使用不同地理位置和不同平台上的计算资源来分析大量数据。然而,为了利用这些资源的力量,必须解决许多问题。本文主要研究异构计算机系统的容错问题。我们的建议旨在简化在运行时检测到系统故障时的恢复过程,避免重新启动应用程序的必要性。我们的建议通过一组在计算节点上执行透明任务迁移的服务来工作,当使用基于Open MPI和OpenCL的混合编程模型时,隐藏了与错误处理相关的复杂性。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信