Thanh Do, Thuy Nguyen, D. Nguyen, H. Nguyen, Weisong Shi
{"title":"Trouble Dashboard: A Distributed Failure Monitoring System for High-End Computing","authors":"Thanh Do, Thuy Nguyen, D. Nguyen, H. Nguyen, Weisong Shi","doi":"10.1109/RIVF.2009.5174661","DOIUrl":null,"url":null,"abstract":"Failure management is crucial for high performance computing systems, especially when the complexity of applications and underlying infrastructure has grown sharply in recent years. In this paper, we present the design, implementation and experiment of trouble dashboard (TD), an adaptive, flexible, and low overhead failure monitoring system. Our goal is to provide a lightweight, scalable failure-monitoring tool for both application scientists and system managers. In TD, a set of APIs is provided for application scientists to control the behavior of their applications with flexibility when failures happen. System managers can use the tool to monitor the status of not only computing nodes and running tasks but also failures when they occur. Experiments show that TD incurs low overhead, and remains accurate and flexible enough to adapt to various applications.","PeriodicalId":243397,"journal":{"name":"2009 IEEE-RIVF International Conference on Computing and Communication Technologies","volume":"70 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2009-07-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2009 IEEE-RIVF International Conference on Computing and Communication Technologies","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/RIVF.2009.5174661","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
Failure management is crucial for high performance computing systems, especially when the complexity of applications and underlying infrastructure has grown sharply in recent years. In this paper, we present the design, implementation and experiment of trouble dashboard (TD), an adaptive, flexible, and low overhead failure monitoring system. Our goal is to provide a lightweight, scalable failure-monitoring tool for both application scientists and system managers. In TD, a set of APIs is provided for application scientists to control the behavior of their applications with flexibility when failures happen. System managers can use the tool to monitor the status of not only computing nodes and running tasks but also failures when they occur. Experiments show that TD incurs low overhead, and remains accurate and flexible enough to adapt to various applications.