自我关怀IT系统:虚拟环境中的概念验证实现

2010 IEEE Second International Conference on Cloud Computing Technology and Science Pub Date : 2010-11-30 DOI:10.1109/CloudCom.2010.83

S. Kadirvel, J. Fortes

{"title":"自我关怀IT系统:虚拟环境中的概念验证实现","authors":"S. Kadirvel, J. Fortes","doi":"10.1109/CloudCom.2010.83","DOIUrl":null,"url":null,"abstract":"In self-caring IT systems, faults are handled proactively, e.g. by slowing down the deterioration of system health thereby effectively avoiding or delaying system failures. This requires health management which entails health monitoring, diagnosis, prognosis, planning of recovery and remediation actions. A brief overview of our prior work, which proposes a general methodology to capture system properties and incorporate health management using Petri nets, is provided. We describe in detail an application of the proposed formal method to the design and development of middleware that can manage the health of a batch-based, job submission system on a virtualized platform. First, we describe how a real world job submission IT system is converted to a Petri net model. Secondly, we show system validation and analysis using this model to understand resource needs of different activities in the IT chain. Thirdly, we describe how the executable model is used as a system manager to control operation and health management of a virtualized test bed. Fourthly, we illustrate the use of a feedback controller to manage health deterioration due to resource depletion in the job-execution stage of the modeled IT chain. Using a proof-of-concept implementation, we show that the early detection and handling of health deteriorations results in significant benefits in terms of cost savings and down time reduction. Experimental results show that our health management framework can be used to effectively prevent job failures, while imposing low overhead to the managed system. We have shown that for a typical workload consisting of jobs that suffer from potential resource depletion faults, our feedback controller can be used to gain useful life that is needed for critical planning and remediation actions in up to 82% of the jobs.","PeriodicalId":130987,"journal":{"name":"2010 IEEE Second International Conference on Cloud Computing Technology and Science","volume":"17 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2010-11-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"3","resultStr":"{\"title\":\"Self-Caring IT Systems: A Proof-of-Concept Implementation in Virtualized Environments\",\"authors\":\"S. Kadirvel, J. Fortes\",\"doi\":\"10.1109/CloudCom.2010.83\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"In self-caring IT systems, faults are handled proactively, e.g. by slowing down the deterioration of system health thereby effectively avoiding or delaying system failures. This requires health management which entails health monitoring, diagnosis, prognosis, planning of recovery and remediation actions. A brief overview of our prior work, which proposes a general methodology to capture system properties and incorporate health management using Petri nets, is provided. We describe in detail an application of the proposed formal method to the design and development of middleware that can manage the health of a batch-based, job submission system on a virtualized platform. First, we describe how a real world job submission IT system is converted to a Petri net model. Secondly, we show system validation and analysis using this model to understand resource needs of different activities in the IT chain. Thirdly, we describe how the executable model is used as a system manager to control operation and health management of a virtualized test bed. Fourthly, we illustrate the use of a feedback controller to manage health deterioration due to resource depletion in the job-execution stage of the modeled IT chain. Using a proof-of-concept implementation, we show that the early detection and handling of health deteriorations results in significant benefits in terms of cost savings and down time reduction. Experimental results show that our health management framework can be used to effectively prevent job failures, while imposing low overhead to the managed system. We have shown that for a typical workload consisting of jobs that suffer from potential resource depletion faults, our feedback controller can be used to gain useful life that is needed for critical planning and remediation actions in up to 82% of the jobs.\",\"PeriodicalId\":130987,\"journal\":{\"name\":\"2010 IEEE Second International Conference on Cloud Computing Technology and Science\",\"volume\":\"17 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2010-11-30\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"3\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2010 IEEE Second International Conference on Cloud Computing Technology and Science\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/CloudCom.2010.83\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2010 IEEE Second International Conference on Cloud Computing Technology and Science","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/CloudCom.2010.83","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 3

摘要

在自我照顾的资讯科技系统中，系统会主动处理故障，例如减缓系统健康恶化的速度，从而有效避免或延迟系统故障。这需要健康管理，包括健康监测、诊断、预后、恢复和补救行动的规划。本文简要概述了我们之前的工作，其中提出了一种通用的方法来捕获系统属性并使用Petri网合并健康管理。我们详细描述了将所提出的形式化方法应用于中间件的设计和开发，该中间件可以管理虚拟化平台上基于批处理的作业提交系统的运行状况。首先，我们描述了如何将现实世界的作业提交IT系统转换为Petri网模型。其次，我们使用该模型进行系统验证和分析，以了解IT链中不同活动的资源需求。第三，我们描述了如何使用可执行模型作为系统管理器来控制虚拟测试平台的运行和健康管理。第四，我们举例说明了在建模的IT链的作业执行阶段使用反馈控制器来管理由于资源耗尽而导致的健康恶化。通过概念验证实现，我们证明了早期检测和处理健康状况恶化可以在节省成本和减少停机时间方面带来显著的好处。实验结果表明，我们的健康管理框架可以有效地防止作业失败，同时对被管理系统施加较低的开销。我们已经表明，对于典型的工作负载，包括遭受潜在资源耗尽错误的作业，我们的反馈控制器可用于获得高达82%的作业所需的关键规划和补救行动的使用寿命。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Self-Caring IT Systems: A Proof-of-Concept Implementation in Virtualized Environments

In self-caring IT systems, faults are handled proactively, e.g. by slowing down the deterioration of system health thereby effectively avoiding or delaying system failures. This requires health management which entails health monitoring, diagnosis, prognosis, planning of recovery and remediation actions. A brief overview of our prior work, which proposes a general methodology to capture system properties and incorporate health management using Petri nets, is provided. We describe in detail an application of the proposed formal method to the design and development of middleware that can manage the health of a batch-based, job submission system on a virtualized platform. First, we describe how a real world job submission IT system is converted to a Petri net model. Secondly, we show system validation and analysis using this model to understand resource needs of different activities in the IT chain. Thirdly, we describe how the executable model is used as a system manager to control operation and health management of a virtualized test bed. Fourthly, we illustrate the use of a feedback controller to manage health deterioration due to resource depletion in the job-execution stage of the modeled IT chain. Using a proof-of-concept implementation, we show that the early detection and handling of health deteriorations results in significant benefits in terms of cost savings and down time reduction. Experimental results show that our health management framework can be used to effectively prevent job failures, while imposing low overhead to the managed system. We have shown that for a typical workload consisting of jobs that suffer from potential resource depletion faults, our feedback controller can be used to gain useful life that is needed for critical planning and remediation actions in up to 82% of the jobs.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2010 IEEE Second International Conference on Cloud Computing Technology and Science

自引率

0.00%

发文量