Towards Data-Driven Autonomics in Data Centers

2015 International Conference on Cloud and Autonomic Computing Pub Date : 2015-05-19 DOI:10.1109/ICCAC.2015.19

A. Sîrbu, Özalp Babaoglu

{"title":"Towards Data-Driven Autonomics in Data Centers","authors":"A. Sîrbu, Özalp Babaoglu","doi":"10.1109/ICCAC.2015.19","DOIUrl":null,"url":null,"abstract":"Continued reliance on human operators for managing data centers is a major impediment for them from ever reaching extreme dimensions. Large computer systems in general, and data centers in particular, will ultimately be managed using predictive computational and executable models obtained through data-science tools, and at that point, the intervention of humans will be limited to setting high-level goals and policies rather than performing low-level operations. Data-driven autonomics, where management and control are based on holistic predictive models that are built and updated using generated data, opens one possible path towards limiting the role of operators in data centers. In this paper, we present a data-science study of a public Google dataset collected in a 12K-node cluster with the goal of building and evaluating a predictive model for node failures. We use BigQuery, the big data SQL platform from the Google Cloud suite, to process massive amounts of data and generate a rich feature set characterizing machine state over time. We describe how an ensemble classifier can be built out of many Random Forest classifiers each trained on these features, to predict if machines will fail in a future 24-hour window. Our evaluation reveals that if we limit false positive rates to 5%, we can achieve true positive rates between 27% and 88% with precision varying between 50% and 72%. We discuss the practicality of including our predictive model as the central component of a data-driven autonomic manager and operating it on-line with live data streams (rather than off-line on data logs). All of the scripts used for BigQuery and classification analyses are publicly available from the authors' website.","PeriodicalId":133491,"journal":{"name":"2015 International Conference on Cloud and Autonomic Computing","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2015-05-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"18","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2015 International Conference on Cloud and Autonomic Computing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICCAC.2015.19","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 18

Abstract

Continued reliance on human operators for managing data centers is a major impediment for them from ever reaching extreme dimensions. Large computer systems in general, and data centers in particular, will ultimately be managed using predictive computational and executable models obtained through data-science tools, and at that point, the intervention of humans will be limited to setting high-level goals and policies rather than performing low-level operations. Data-driven autonomics, where management and control are based on holistic predictive models that are built and updated using generated data, opens one possible path towards limiting the role of operators in data centers. In this paper, we present a data-science study of a public Google dataset collected in a 12K-node cluster with the goal of building and evaluating a predictive model for node failures. We use BigQuery, the big data SQL platform from the Google Cloud suite, to process massive amounts of data and generate a rich feature set characterizing machine state over time. We describe how an ensemble classifier can be built out of many Random Forest classifiers each trained on these features, to predict if machines will fail in a future 24-hour window. Our evaluation reveals that if we limit false positive rates to 5%, we can achieve true positive rates between 27% and 88% with precision varying between 50% and 72%. We discuss the practicality of including our predictive model as the central component of a data-driven autonomic manager and operating it on-line with live data streams (rather than off-line on data logs). All of the scripts used for BigQuery and classification analyses are publicly available from the authors' website.

查看原文本刊更多论文

迈向数据中心的数据驱动自治

继续依赖人工操作人员来管理数据中心，是它们达到极端规模的主要障碍。一般来说，大型计算机系统，特别是数据中心，最终将使用通过数据科学工具获得的预测计算和可执行模型进行管理，到那时，人类的干预将仅限于设置高级目标和策略，而不是执行低级操作。数据驱动的自治系统，其管理和控制基于使用生成的数据构建和更新的整体预测模型，为限制数据中心操作员的作用开辟了一条可能的途径。在本文中，我们提出了一项数据科学研究，该研究收集了一个12k节点集群中的公共谷歌数据集，目的是建立和评估节点故障的预测模型。我们使用BigQuery(来自Google Cloud套件的大数据SQL平台)来处理大量数据，并生成丰富的特性集来描述机器随时间的状态。我们描述了如何用许多随机森林分类器构建一个集成分类器，每个分类器都在这些特征上进行训练，以预测机器是否会在未来24小时内发生故障。我们的评估表明，如果我们将假阳性率限制在5%，我们可以实现27%到88%的真阳性率，精度在50%到72%之间变化。我们讨论了将我们的预测模型作为数据驱动的自治管理器的中心组件，并使用实时数据流(而不是离线的数据日志)在线操作它的实用性。所有用于BigQuery和分类分析的脚本都可以从作者的网站上公开获得。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2015 International Conference on Cloud and Autonomic Computing

自引率

0.00%

发文量