Performance troubleshooting in data centers: an annotated bibliography?

ACM SIGOPS Oper. Syst. Rev. Pub Date : 2013-11-26 DOI:10.1145/2553070.2553079

Chengwei Wang, Soila Kavulya, Jiaqi Tan, Liting Hu, Mahendra Kutare, Michael P. Kasick, K. Schwan, P. Narasimhan, R. Gandhi

{"title":"Performance troubleshooting in data centers: an annotated bibliography?","authors":"Chengwei Wang, Soila Kavulya, Jiaqi Tan, Liting Hu, Mahendra Kutare, Michael P. Kasick, K. Schwan, P. Narasimhan, R. Gandhi","doi":"10.1145/2553070.2553079","DOIUrl":null,"url":null,"abstract":"In the emerging cloud computing era, enterprise data centers host a plethora of web services and applications, including those for e-Commerce, distributed multimedia, and social networks, which jointly, serve many aspects of our daily lives and business. For such applications, lack of availability, reliability, or responsiveness can lead to extensive losses. For instance, on June 29 2010, Amazon.com experienced three hours of intermittent performance problems as the normally reliable website took minutes to load items, and searches came back without product links. Customers were also unable to place orders. Based on their 2010 quarterly revenues, such downtime could cost Amazon up to $1.75 million per hour, thus making rapid problem resolution critical to its business. In another serious incident, on July 7, 2010, DBS bank in Singapore suffered a 7-hour outage which crippled its Internet banking systems, and disrupted other consumer banking services, including automated teller machines, credit card and NETS payments. The cascading failure occurred due to a procedural error while replacing a faulty component in one of the bank’s storage systems that was connected to its main computers. The high-cost of downtime in large-scale distributed systems drives the need for troubleshooting tools that can quickly detect problems and point system administrators to potential solutions. The increasing size and complexity of enterprise applications, coupled with the large scale of data centers in which they operate, make troubleshooting extremely challenging. Problems can arise due to a large variety of root-causes because of the complex interactions between hardware and software systems. The large volume of monitoring data available in these systems can obscure the root-cause of these problems. Lastly, the multi-tier nature of applications composed of entirely different subsystems man-","PeriodicalId":7046,"journal":{"name":"ACM SIGOPS Oper. Syst. Rev.","volume":"8 1","pages":"50-62"},"PeriodicalIF":0.0000,"publicationDate":"2013-11-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"35","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"ACM SIGOPS Oper. Syst. Rev.","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/2553070.2553079","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 35

Abstract

In the emerging cloud computing era, enterprise data centers host a plethora of web services and applications, including those for e-Commerce, distributed multimedia, and social networks, which jointly, serve many aspects of our daily lives and business. For such applications, lack of availability, reliability, or responsiveness can lead to extensive losses. For instance, on June 29 2010, Amazon.com experienced three hours of intermittent performance problems as the normally reliable website took minutes to load items, and searches came back without product links. Customers were also unable to place orders. Based on their 2010 quarterly revenues, such downtime could cost Amazon up to $1.75 million per hour, thus making rapid problem resolution critical to its business. In another serious incident, on July 7, 2010, DBS bank in Singapore suffered a 7-hour outage which crippled its Internet banking systems, and disrupted other consumer banking services, including automated teller machines, credit card and NETS payments. The cascading failure occurred due to a procedural error while replacing a faulty component in one of the bank’s storage systems that was connected to its main computers. The high-cost of downtime in large-scale distributed systems drives the need for troubleshooting tools that can quickly detect problems and point system administrators to potential solutions. The increasing size and complexity of enterprise applications, coupled with the large scale of data centers in which they operate, make troubleshooting extremely challenging. Problems can arise due to a large variety of root-causes because of the complex interactions between hardware and software systems. The large volume of monitoring data available in these systems can obscure the root-cause of these problems. Lastly, the multi-tier nature of applications composed of entirely different subsystems man-

查看原文本刊更多论文

数据中心的性能故障排除:带注释的参考书目?

在新兴的云计算时代，企业数据中心托管着大量的web服务和应用程序，包括用于电子商务、分布式多媒体和社交网络的web服务和应用程序，它们共同为我们日常生活和业务的许多方面提供服务。对于这样的应用程序，缺乏可用性、可靠性或响应性可能导致大量的损失。例如，2010年6月29日，亚马逊网站经历了三个小时的间歇性性能问题，因为这个通常可靠的网站需要几分钟才能加载商品，而且搜索回来时没有产品链接。客户也无法下订单。根据他们2010年的季度收入，这样的停机时间每小时可能会给亚马逊造成175万美元的损失，因此快速解决问题对其业务至关重要。在另一起严重事件中，2010年7月7日，新加坡星展银行(DBS bank)遭遇了7小时的停机，导致其网上银行系统瘫痪，并中断了其他消费银行服务，包括自动柜员机、信用卡和网络支付。这次级联故障是由于在更换连接到银行主计算机的存储系统中的一个故障组件时出现程序错误造成的。大规模分布式系统中的高停机成本促使人们需要能够快速检测问题并为系统管理员提供潜在解决方案的故障排除工具。企业应用程序的规模和复杂性不断增加，再加上它们运行的数据中心规模庞大，这使得故障排除极具挑战性。由于硬件和软件系统之间复杂的相互作用，各种各样的根本原因都可能导致问题的出现。这些系统中可用的大量监测数据可能掩盖了这些问题的根本原因。最后，应用程序的多层性质是由完全不同的子系统组成的

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

ACM SIGOPS Oper. Syst. Rev.

自引率

0.00%

发文量