系统故障自动恢复研究

Fourth International Conference on Autonomic Computing (ICAC'07) Pub Date : 2007-06-11 DOI:10.1109/ICAC.2007.40

Gabriela Jacques-Silva, J. Challenger, Lou Degenaro, J. Giles, R. Wagle

{"title":"系统故障自动恢复研究","authors":"Gabriela Jacques-Silva, J. Challenger, Lou Degenaro, J. Giles, R. Wagle","doi":"10.1109/ICAC.2007.40","DOIUrl":null,"url":null,"abstract":"System-S is a stream processing infrastructure which enables program fragments to be distributed and connected to form complex applications. There may be potentially tens of thousands of interdependent and heterogeneous program fragments running across thousands of nodes. While the scale and interconnection imply the need for automation to manage the program fragments, the need is intensified because the applications operate on live streaming data and thus need to be highly available. System-S has been designed with components that autonomically manage the program fragments, but the system components themselves are also susceptible to failures which can jeopardize the system and its applications. The work we present addresses the self healing nature of these management components in System-S. In particular, we show how one key component of System-S, the job management orchestrator, can be abruptly terminated and then recover without interrupting any of the running program fragments by reconciling with other autonomous system components. We also describe techniques that we have developed to validate that the system is able to autonomically respond to a wide variety of error conditions including the abrupt termination and recovery of key system components. Finally, we show the performance of the job management orchestrator recovery for a variety of workloads.","PeriodicalId":179923,"journal":{"name":"Fourth International Conference on Autonomic Computing (ICAC'07)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2007-06-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"36","resultStr":"{\"title\":\"Towards Autonomic Fault Recovery in System-S\",\"authors\":\"Gabriela Jacques-Silva, J. Challenger, Lou Degenaro, J. Giles, R. Wagle\",\"doi\":\"10.1109/ICAC.2007.40\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"System-S is a stream processing infrastructure which enables program fragments to be distributed and connected to form complex applications. There may be potentially tens of thousands of interdependent and heterogeneous program fragments running across thousands of nodes. While the scale and interconnection imply the need for automation to manage the program fragments, the need is intensified because the applications operate on live streaming data and thus need to be highly available. System-S has been designed with components that autonomically manage the program fragments, but the system components themselves are also susceptible to failures which can jeopardize the system and its applications. The work we present addresses the self healing nature of these management components in System-S. In particular, we show how one key component of System-S, the job management orchestrator, can be abruptly terminated and then recover without interrupting any of the running program fragments by reconciling with other autonomous system components. We also describe techniques that we have developed to validate that the system is able to autonomically respond to a wide variety of error conditions including the abrupt termination and recovery of key system components. Finally, we show the performance of the job management orchestrator recovery for a variety of workloads.\",\"PeriodicalId\":179923,\"journal\":{\"name\":\"Fourth International Conference on Autonomic Computing (ICAC'07)\",\"volume\":\"1 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2007-06-11\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"36\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Fourth International Conference on Autonomic Computing (ICAC'07)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ICAC.2007.40\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Fourth International Conference on Autonomic Computing (ICAC'07)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICAC.2007.40","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 36

摘要

System-S是一个流处理基础设施，它使程序片段能够被分发和连接以形成复杂的应用程序。可能有成千上万的相互依赖和异构的程序片段在数千个节点上运行。虽然规模和互连意味着需要自动化来管理程序片段，但由于应用程序在实时流数据上操作，因此需要高可用性，因此需要加强。system - s被设计成具有自主管理程序片段的组件，但是系统组件本身也容易受到可能危及系统及其应用程序的故障的影响。我们提出的工作解决了System-S中这些管理组件的自我修复特性。特别是，我们将展示system - s的一个关键组件(作业管理编排器)如何突然终止，然后通过与其他自治系统组件协调，在不中断任何正在运行的程序片段的情况下恢复。我们还描述了我们开发的技术，以验证系统能够自主响应各种错误条件，包括关键系统组件的突然终止和恢复。最后，我们展示了作业管理编排器恢复在各种工作负载下的性能。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Towards Autonomic Fault Recovery in System-S

System-S is a stream processing infrastructure which enables program fragments to be distributed and connected to form complex applications. There may be potentially tens of thousands of interdependent and heterogeneous program fragments running across thousands of nodes. While the scale and interconnection imply the need for automation to manage the program fragments, the need is intensified because the applications operate on live streaming data and thus need to be highly available. System-S has been designed with components that autonomically manage the program fragments, but the system components themselves are also susceptible to failures which can jeopardize the system and its applications. The work we present addresses the self healing nature of these management components in System-S. In particular, we show how one key component of System-S, the job management orchestrator, can be abruptly terminated and then recover without interrupting any of the running program fragments by reconciling with other autonomous system components. We also describe techniques that we have developed to validate that the system is able to autonomically respond to a wide variety of error conditions including the abrupt termination and recovery of key system components. Finally, we show the performance of the job management orchestrator recovery for a variety of workloads.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Fourth International Conference on Autonomic Computing (ICAC'07)

自引率

0.00%

发文量