A scalable and efficient self-organizing failure detector for grid applications

Yuuki Horita, K. Taura, T. Chikayama
{"title":"A scalable and efficient self-organizing failure detector for grid applications","authors":"Yuuki Horita, K. Taura, T. Chikayama","doi":"10.1109/GRID.2005.1542743","DOIUrl":null,"url":null,"abstract":"Failure detection and group membership management are basic building blocks for self-repairing systems in distributed environments, which need to be scalable, reliable, and efficient in practice. As available resources become larger in size and more widely distributed, it is more essential that they can be easily used with a small amount of manual configuration in grid environments, where connectivities between different networks may be limited by firewalls and NATs. In this paper, we present a scalable failure detection protocol that self-organizes in grid environments. Our failure detectors autonomously create dispersed monitoring relationships among participating processes with almost no manual configuration so that each process will be monitored by a small number of other processes, and quickly disseminate notifications along the monitoring relationships when failures are detected. With simulations and real experiments, we showed that our failure detector has a practical scalability, a high reliability, and a good efficiency. The overhead with 313 processes was at most 2-percent even when the heartbeat interval was set to 0.1 second, and accordingly smaller when it was longer.","PeriodicalId":347929,"journal":{"name":"The 6th IEEE/ACM International Workshop on Grid Computing, 2005.","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2005-11-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"41","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"The 6th IEEE/ACM International Workshop on Grid Computing, 2005.","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/GRID.2005.1542743","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 41

Abstract

Failure detection and group membership management are basic building blocks for self-repairing systems in distributed environments, which need to be scalable, reliable, and efficient in practice. As available resources become larger in size and more widely distributed, it is more essential that they can be easily used with a small amount of manual configuration in grid environments, where connectivities between different networks may be limited by firewalls and NATs. In this paper, we present a scalable failure detection protocol that self-organizes in grid environments. Our failure detectors autonomously create dispersed monitoring relationships among participating processes with almost no manual configuration so that each process will be monitored by a small number of other processes, and quickly disseminate notifications along the monitoring relationships when failures are detected. With simulations and real experiments, we showed that our failure detector has a practical scalability, a high reliability, and a good efficiency. The overhead with 313 processes was at most 2-percent even when the heartbeat interval was set to 0.1 second, and accordingly smaller when it was longer.
用于网格应用程序的可伸缩且高效的自组织故障检测器
故障检测和组成员管理是分布式环境中自修复系统的基本组成部分,在实践中需要具有可扩展性、可靠性和高效性。随着可用资源的规模越来越大,分布越来越广泛,在网格环境中,通过少量的手动配置就可以轻松地使用这些资源变得更加重要,在网格环境中,不同网络之间的连接可能受到防火墙和nat的限制。在本文中,我们提出了一种在网格环境中自组织的可扩展故障检测协议。我们的故障检测器在参与的流程之间自主地创建分散的监视关系,几乎不需要手动配置,这样每个流程都将由少量其他流程监视,并且在检测到故障时沿着监视关系快速传播通知。仿真和实际实验表明,该故障检测器具有实用的可扩展性、高可靠性和良好的效率。即使在心跳间隔设置为0.1秒时,313个进程的开销最多也只有2%,当心跳间隔较长时,相应的开销更小。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信