Innovative practices session 5C: Cloud atlas — Unreliability through massive connectivity

2013 IEEE 31st VLSI Test Symposium (VTS) Pub Date : 2013-04-29 DOI:10.1109/VTS.2013.6548907

Helia Naeimi, S. Natarajan, Kushagra Vaid, P. Kudva, Mahesh Natu

{"title":"Innovative practices session 5C: Cloud atlas — Unreliability through massive connectivity","authors":"Helia Naeimi, S. Natarajan, Kushagra Vaid, P. Kudva, Mahesh Natu","doi":"10.1109/VTS.2013.6548907","DOIUrl":null,"url":null,"abstract":"The rapid pace of integration, emergence of low power, low cost computing elements, and ubiquitous and ever-increasing bandwidth of connectivity have given rise to data center and cloud infrastructures. These infrastructures are beginning to be used on a massive scale across vast geographic boundaries to provide commercial services to businesses such as banking, enterprise computing, online sales, and data mining and processing for targeted marketing to name a few. Such an infrastructure comprises of thousands of compute and storage nodes that are interconnected by massive network fabrics, each of them having their own hardware and firmware stacks, with layers of software stacks for operating systems, network protocols, schedulers and application programs. The scale of such an infrastructure has made possible service that has been unimaginable only a few years ago, but has the downside of severe losses in case of failure. A system of such scale and risk necessitates methods to (a) proactively anticipate and protect against impending failures, (b) efficiently, transparently and quickly detect, diagnose and correct failures in any software or hardware layer, and (c) be able to automatically adapt itself based on prior failures to prevent future occurrences. Addressing the above reliability challenges is inherently different from the traditional reliability techniques. First, there is a great amount of redundant resources available in the cloud from networking to computing and storage nodes, which opens up many reliability approaches by harvesting these available redundancies. Second, due to the large scale of the system, techniques with high overheads, especially in power, are not acceptable. Consequently, cross layer approaches to optimize the availability and power have gained traction recently. This session will address these challenges in maintaining reliable service with solutions across the hardware/software stacks. The currently available commercial data-center and cloud infrastructures will be reviewed and the relative occurrences of different causalities of failures, the level to which they are anticipated and diagnosed in practice, and their impact on the quality of service and infrastructure design will be discussed. A study on real-time analytics to proactively address failures in a private, secure cloud engaged in domain-specific computations, with streaming inputs received from embedded computing platforms (such as airborne image sources, data streams, or sensors) will be presented next. The session concludes with a discussion on the increased relevance of resiliency features built inside individual systems and components (private cloud) and how the macro public cloud absorbs innovations from this realm.","PeriodicalId":138435,"journal":{"name":"2013 IEEE 31st VLSI Test Symposium (VTS)","volume":"19 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2013-04-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2013 IEEE 31st VLSI Test Symposium (VTS)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/VTS.2013.6548907","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 2

Abstract

The rapid pace of integration, emergence of low power, low cost computing elements, and ubiquitous and ever-increasing bandwidth of connectivity have given rise to data center and cloud infrastructures. These infrastructures are beginning to be used on a massive scale across vast geographic boundaries to provide commercial services to businesses such as banking, enterprise computing, online sales, and data mining and processing for targeted marketing to name a few. Such an infrastructure comprises of thousands of compute and storage nodes that are interconnected by massive network fabrics, each of them having their own hardware and firmware stacks, with layers of software stacks for operating systems, network protocols, schedulers and application programs. The scale of such an infrastructure has made possible service that has been unimaginable only a few years ago, but has the downside of severe losses in case of failure. A system of such scale and risk necessitates methods to (a) proactively anticipate and protect against impending failures, (b) efficiently, transparently and quickly detect, diagnose and correct failures in any software or hardware layer, and (c) be able to automatically adapt itself based on prior failures to prevent future occurrences. Addressing the above reliability challenges is inherently different from the traditional reliability techniques. First, there is a great amount of redundant resources available in the cloud from networking to computing and storage nodes, which opens up many reliability approaches by harvesting these available redundancies. Second, due to the large scale of the system, techniques with high overheads, especially in power, are not acceptable. Consequently, cross layer approaches to optimize the availability and power have gained traction recently. This session will address these challenges in maintaining reliable service with solutions across the hardware/software stacks. The currently available commercial data-center and cloud infrastructures will be reviewed and the relative occurrences of different causalities of failures, the level to which they are anticipated and diagnosed in practice, and their impact on the quality of service and infrastructure design will be discussed. A study on real-time analytics to proactively address failures in a private, secure cloud engaged in domain-specific computations, with streaming inputs received from embedded computing platforms (such as airborne image sources, data streams, or sensors) will be presented next. The session concludes with a discussion on the increased relevance of resiliency features built inside individual systems and components (private cloud) and how the macro public cloud absorbs innovations from this realm.

查看原文本刊更多论文

创新实践环节5C:云图谱——海量连接带来的不可靠性

集成的快速步伐、低功耗、低成本计算元素的出现，以及无处不在且不断增加的连接带宽，催生了数据中心和云基础设施。这些基础设施开始大规模地跨越广阔的地理边界，为企业提供商业服务，如银行、企业计算、在线销售、针对目标营销的数据挖掘和处理等等。这样的基础设施由数千个计算和存储节点组成，这些节点通过庞大的网络结构相互连接，每个节点都有自己的硬件和固件堆栈，以及操作系统、网络协议、调度程序和应用程序的软件堆栈层。这种基础设施的规模使几年前还无法想象的服务成为可能，但它的缺点是一旦出现故障就会造成严重损失。如此规模和风险的系统需要以下方法:(A)主动预测和防止即将发生的故障，(b)有效、透明和快速地检测、诊断和纠正任何软件或硬件层的故障，以及(c)能够根据先前的故障自动调整自身以防止未来发生。解决上述可靠性挑战与传统可靠性技术本质上是不同的。首先，从网络到计算和存储节点，云中有大量可用的冗余资源，通过收集这些可用的冗余，开辟了许多可靠性方法。其次，由于系统的规模大，开销高的技术，特别是在功率方面，是不可接受的。因此，优化可用性和功率的跨层方法最近获得了关注。本次会议将讨论在使用跨硬件/软件堆栈的解决方案来维护可靠服务方面的这些挑战。将审查目前可用的商业数据中心和云基础设施，并讨论不同故障因果关系的相对发生率、在实践中预测和诊断的程度以及它们对服务质量和基础设施设计的影响。接下来将介绍一项实时分析研究，以主动解决私有安全云中的故障，该云从事特定领域的计算，并从嵌入式计算平台(如机载图像源、数据流或传感器)接收流输入。会议最后讨论了在单个系统和组件(私有云)内部构建的弹性特性的相关性增加，以及宏观公共云如何从这一领域吸收创新。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2013 IEEE 31st VLSI Test Symposium (VTS)

自引率

0.00%

发文量