Recovery for Virtualized Environments

2015 11th European Dependable Computing Conference (EDCC) Pub Date : 2015-09-01 DOI:10.1109/EDCC.2015.26

F. Cerveira, R. Barbosa, H. Madeira, Filipe Araújo

{"title":"Recovery for Virtualized Environments","authors":"F. Cerveira, R. Barbosa, H. Madeira, Filipe Araújo","doi":"10.1109/EDCC.2015.26","DOIUrl":null,"url":null,"abstract":"Cloud infrastructures provide elastic computing resources to client organizations, enabling them to build online applications while avoiding the fixed costs associated to a complete IT infrastructure. However, such organizations are unlikely to fully trust the cloud for the most critical applications. Among other threats, soft errors are expected to increase with the shrinking geometries of transistors, and many errors are left for the software layers to correct and mask. This paper characterizes the behavior of a virtualized environment, using Xen with CentOS as the hypervisor, in presence of soft errors. One of the main threats arises from soft errors directly affecting the hypervisor, as these faults have the potential to disrupt several virtual machines at once. With this in mind, we develop a fault tolerant architecture for cloud applications, which relies on experimental data collected using fault injection to guide its design. This architecture recovers from bit-flip errors with the help of a watchdog timer, to securely reboot the hypervisor. Nevertheless, errors might still propagate outside the system, for example to a client in a client-server interaction. Despite this, our results suggest that our architecture and a few simple techniques, like timers on the client, can recover a very large fraction of errors in client-server applications with small hardware and performance overhead. Conversely, the fraction of errors requiring Byzantine fault-tolerant techniques is quite small, thus restricting those expensive approaches to highly critical applications.","PeriodicalId":138826,"journal":{"name":"2015 11th European Dependable Computing Conference (EDCC)","volume":"34 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2015-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"12","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2015 11th European Dependable Computing Conference (EDCC)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/EDCC.2015.26","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 12

Abstract

Cloud infrastructures provide elastic computing resources to client organizations, enabling them to build online applications while avoiding the fixed costs associated to a complete IT infrastructure. However, such organizations are unlikely to fully trust the cloud for the most critical applications. Among other threats, soft errors are expected to increase with the shrinking geometries of transistors, and many errors are left for the software layers to correct and mask. This paper characterizes the behavior of a virtualized environment, using Xen with CentOS as the hypervisor, in presence of soft errors. One of the main threats arises from soft errors directly affecting the hypervisor, as these faults have the potential to disrupt several virtual machines at once. With this in mind, we develop a fault tolerant architecture for cloud applications, which relies on experimental data collected using fault injection to guide its design. This architecture recovers from bit-flip errors with the help of a watchdog timer, to securely reboot the hypervisor. Nevertheless, errors might still propagate outside the system, for example to a client in a client-server interaction. Despite this, our results suggest that our architecture and a few simple techniques, like timers on the client, can recover a very large fraction of errors in client-server applications with small hardware and performance overhead. Conversely, the fraction of errors requiring Byzantine fault-tolerant techniques is quite small, thus restricting those expensive approaches to highly critical applications.

查看原文本刊更多论文

虚拟化环境恢复

云基础设施为客户组织提供弹性计算资源，使他们能够构建在线应用程序，同时避免与完整IT基础设施相关的固定成本。然而，这些组织不太可能完全信任云计算最关键的应用程序。在其他威胁中，软误差预计会随着晶体管几何形状的缩小而增加，并且许多错误会留给软件层来纠正和掩盖。本文描述了在存在软错误的情况下，使用Xen和CentOS作为管理程序的虚拟化环境的行为。主要威胁之一来自直接影响管理程序的软错误，因为这些错误有可能同时中断多个虚拟机。考虑到这一点，我们为云应用程序开发了一个容错架构，该架构依赖于使用故障注入收集的实验数据来指导其设计。该体系结构在看门狗计时器的帮助下从位翻转错误中恢复，以安全地重新启动管理程序。尽管如此，错误仍然可能传播到系统外部，例如在客户机-服务器交互中传播到客户机。尽管如此，我们的结果表明，我们的体系结构和一些简单的技术，比如客户机上的计时器，可以在硬件和性能开销很小的情况下恢复客户机-服务器应用程序中很大一部分错误。相反，需要拜占庭式容错技术的错误比例非常小，因此限制了那些昂贵的方法用于高度关键的应用程序。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2015 11th European Dependable Computing Conference (EDCC)

自引率

0.00%

发文量