云节点故障预测中缺失数据处理的实证研究

软件产业与工程 Pub Date : 2022-11-07 DOI:10.1145/3540250.3558946

Minghua Ma, Yudong Liu, Yuang Tong, Haozhe Li, Pu Zhao, Yong Xu, Hongyu Zhang, Shilin He, Lu Wang, Yingnong Dang, S. Rajmohan, Qingwei Lin

{"title":"云节点故障预测中缺失数据处理的实证研究","authors":"Minghua Ma, Yudong Liu, Yuang Tong, Haozhe Li, Pu Zhao, Yong Xu, Hongyu Zhang, Shilin He, Lu Wang, Yingnong Dang, S. Rajmohan, Qingwei Lin","doi":"10.1145/3540250.3558946","DOIUrl":null,"url":null,"abstract":"Cloud computing systems have become increasingly popular in recent years. A typical cloud system utilizes millions of computing nodes as the basic infrastructure. Node failure has been identified as one of the most prevalent causes of cloud system downtime. To improve the reliability of cloud systems, many previous studies collected monitoring metrics from nodes and built models to predict node failures before the failures happen. However, based on our experience with large-scale real-world cloud systems in Microsoft, we find that the task of predicting node failure is severely hampered by missing data. There is a large amount of missing data, and the online latest data utilized for prediction is even worse. As a result, the real-time performance of the node prediction model is limited. In this paper, we first characterize the missing data problem for node failure prediction. Then, we evaluate several existing data interpolation approaches, and find that node dimension interpolation approaches outperform time dimension ones and deep learning based interpolation is the best for early prediction. Our findings can help academics and engineers address the missing data problem in cloud node failure prediction and other data-driven software engineering scenarios.","PeriodicalId":68155,"journal":{"name":"软件产业与工程","volume":"30 1 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2022-11-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"10","resultStr":"{\"title\":\"An empirical investigation of missing data handling in cloud node failure prediction\",\"authors\":\"Minghua Ma, Yudong Liu, Yuang Tong, Haozhe Li, Pu Zhao, Yong Xu, Hongyu Zhang, Shilin He, Lu Wang, Yingnong Dang, S. Rajmohan, Qingwei Lin\",\"doi\":\"10.1145/3540250.3558946\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Cloud computing systems have become increasingly popular in recent years. A typical cloud system utilizes millions of computing nodes as the basic infrastructure. Node failure has been identified as one of the most prevalent causes of cloud system downtime. To improve the reliability of cloud systems, many previous studies collected monitoring metrics from nodes and built models to predict node failures before the failures happen. However, based on our experience with large-scale real-world cloud systems in Microsoft, we find that the task of predicting node failure is severely hampered by missing data. There is a large amount of missing data, and the online latest data utilized for prediction is even worse. As a result, the real-time performance of the node prediction model is limited. In this paper, we first characterize the missing data problem for node failure prediction. Then, we evaluate several existing data interpolation approaches, and find that node dimension interpolation approaches outperform time dimension ones and deep learning based interpolation is the best for early prediction. Our findings can help academics and engineers address the missing data problem in cloud node failure prediction and other data-driven software engineering scenarios.\",\"PeriodicalId\":68155,\"journal\":{\"name\":\"软件产业与工程\",\"volume\":\"30 1 1\",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2022-11-07\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"10\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"软件产业与工程\",\"FirstCategoryId\":\"1089\",\"ListUrlMain\":\"https://doi.org/10.1145/3540250.3558946\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"软件产业与工程","FirstCategoryId":"1089","ListUrlMain":"https://doi.org/10.1145/3540250.3558946","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 10

摘要

近年来，云计算系统变得越来越流行。一个典型的云系统使用数以百万计的计算节点作为基础设施。节点故障已被确定为云系统停机的最常见原因之一。为了提高云系统的可靠性，许多先前的研究从节点收集监测指标，并建立模型，在节点故障发生之前预测节点故障。然而，根据我们在微软大规模真实云系统中的经验，我们发现预测节点故障的任务受到数据缺失的严重阻碍。存在大量的数据缺失，而用于预测的在线最新数据更差。因此，节点预测模型的实时性受到限制。在本文中，我们首先描述了节点故障预测中的数据缺失问题。然后，我们评估了几种现有的数据插值方法，发现节点维插值方法优于时间维插值方法，基于深度学习的插值方法最适合早期预测。我们的发现可以帮助学者和工程师解决云节点故障预测和其他数据驱动的软件工程场景中的数据缺失问题。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

An empirical investigation of missing data handling in cloud node failure prediction

Cloud computing systems have become increasingly popular in recent years. A typical cloud system utilizes millions of computing nodes as the basic infrastructure. Node failure has been identified as one of the most prevalent causes of cloud system downtime. To improve the reliability of cloud systems, many previous studies collected monitoring metrics from nodes and built models to predict node failures before the failures happen. However, based on our experience with large-scale real-world cloud systems in Microsoft, we find that the task of predicting node failure is severely hampered by missing data. There is a large amount of missing data, and the online latest data utilized for prediction is even worse. As a result, the real-time performance of the node prediction model is limited. In this paper, we first characterize the missing data problem for node failure prediction. Then, we evaluate several existing data interpolation approaches, and find that node dimension interpolation approaches outperform time dimension ones and deep learning based interpolation is the best for early prediction. Our findings can help academics and engineers address the missing data problem in cloud node failure prediction and other data-driven software engineering scenarios.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

软件产业与工程

自引率

0.00%

发文量

676