A scalable and efficient self-organizing failure detector for grid applications

The 6th IEEE/ACM International Workshop on Grid Computing, 2005. Pub Date : 2005-11-13 DOI:10.1109/GRID.2005.1542743

Yuuki Horita, K. Taura, T. Chikayama

{"title":"A scalable and efficient self-organizing failure detector for grid applications","authors":"Yuuki Horita, K. Taura, T. Chikayama","doi":"10.1109/GRID.2005.1542743","DOIUrl":null,"url":null,"abstract":"Failure detection and group membership management are basic building blocks for self-repairing systems in distributed environments, which need to be scalable, reliable, and efficient in practice. As available resources become larger in size and more widely distributed, it is more essential that they can be easily used with a small amount of manual configuration in grid environments, where connectivities between different networks may be limited by firewalls and NATs. In this paper, we present a scalable failure detection protocol that self-organizes in grid environments. Our failure detectors autonomously create dispersed monitoring relationships among participating processes with almost no manual configuration so that each process will be monitored by a small number of other processes, and quickly disseminate notifications along the monitoring relationships when failures are detected. With simulations and real experiments, we showed that our failure detector has a practical scalability, a high reliability, and a good efficiency. The overhead with 313 processes was at most 2-percent even when the heartbeat interval was set to 0.1 second, and accordingly smaller when it was longer.","PeriodicalId":347929,"journal":{"name":"The 6th IEEE/ACM International Workshop on Grid Computing, 2005.","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2005-11-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"41","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"The 6th IEEE/ACM International Workshop on Grid Computing, 2005.","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/GRID.2005.1542743","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 41

Abstract

Failure detection and group membership management are basic building blocks for self-repairing systems in distributed environments, which need to be scalable, reliable, and efficient in practice. As available resources become larger in size and more widely distributed, it is more essential that they can be easily used with a small amount of manual configuration in grid environments, where connectivities between different networks may be limited by firewalls and NATs. In this paper, we present a scalable failure detection protocol that self-organizes in grid environments. Our failure detectors autonomously create dispersed monitoring relationships among participating processes with almost no manual configuration so that each process will be monitored by a small number of other processes, and quickly disseminate notifications along the monitoring relationships when failures are detected. With simulations and real experiments, we showed that our failure detector has a practical scalability, a high reliability, and a good efficiency. The overhead with 313 processes was at most 2-percent even when the heartbeat interval was set to 0.1 second, and accordingly smaller when it was longer.

查看原文本刊更多论文

用于网格应用程序的可伸缩且高效的自组织故障检测器

故障检测和组成员管理是分布式环境中自修复系统的基本组成部分，在实践中需要具有可扩展性、可靠性和高效性。随着可用资源的规模越来越大，分布越来越广泛，在网格环境中，通过少量的手动配置就可以轻松地使用这些资源变得更加重要，在网格环境中，不同网络之间的连接可能受到防火墙和nat的限制。在本文中，我们提出了一种在网格环境中自组织的可扩展故障检测协议。我们的故障检测器在参与的流程之间自主地创建分散的监视关系，几乎不需要手动配置，这样每个流程都将由少量其他流程监视，并且在检测到故障时沿着监视关系快速传播通知。仿真和实际实验表明，该故障检测器具有实用的可扩展性、高可靠性和良好的效率。即使在心跳间隔设置为0.1秒时，313个进程的开销最多也只有2%，当心跳间隔较长时，相应的开销更小。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

The 6th IEEE/ACM International Workshop on Grid Computing, 2005.

自引率

0.00%

发文量