基于深度学习的大规模云应用故障预测模型

2021 IEEE International Systems Conference (SysCon) Pub Date : 2021-04-15 DOI:10.1109/SysCon48628.2021.9447141

Mohammad S. Jassas, Q. Mahmoud

{"title":"基于深度学习的大规模云应用故障预测模型","authors":"Mohammad S. Jassas, Q. Mahmoud","doi":"10.1109/SysCon48628.2021.9447141","DOIUrl":null,"url":null,"abstract":"Many cloud service providers face significant challenges in preventing hardware and software failure from occurring. Due to the large scale and heterogeneous nature of cloud computing, cloud services continue to experience failures in their components. A significant proportion of previous studies have focused on the characterization of failed jobs and understanding their behavior, while a few studies have focused on failure prediction, with a focus on increasing the accuracy of failure prediction models. This paper presents the development and implementation of a failure prediction model using a deep learning approach. The proposed model can identify and detect failed tasks early on before they occur. The key feature of the failure prediction model is to improve the performance of cloud applications by reducing the number of failed jobs. In order to investigate the behavior of failure and apply the prediction of failure to the large-scale environment, we used three different traces, namely Google Cluster Trace, Mustang and Trinity. Moreover, we have evaluated the proposed model performance using different evaluation metrics to ensure that the proposed model provides the highest accuracy of predicted values. The proposed model is designed and implemented to achieve high accuracy for failure prediction, regardless of whether the model uses a large or small trace size. The evaluation results show that our proposed model achieved a high precision, recall and f1 score.","PeriodicalId":384949,"journal":{"name":"2021 IEEE International Systems Conference (SysCon)","volume":"29 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-04-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"3","resultStr":"{\"title\":\"A Failure Prediction Model for Large Scale Cloud Applications using Deep Learning\",\"authors\":\"Mohammad S. Jassas, Q. Mahmoud\",\"doi\":\"10.1109/SysCon48628.2021.9447141\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Many cloud service providers face significant challenges in preventing hardware and software failure from occurring. Due to the large scale and heterogeneous nature of cloud computing, cloud services continue to experience failures in their components. A significant proportion of previous studies have focused on the characterization of failed jobs and understanding their behavior, while a few studies have focused on failure prediction, with a focus on increasing the accuracy of failure prediction models. This paper presents the development and implementation of a failure prediction model using a deep learning approach. The proposed model can identify and detect failed tasks early on before they occur. The key feature of the failure prediction model is to improve the performance of cloud applications by reducing the number of failed jobs. In order to investigate the behavior of failure and apply the prediction of failure to the large-scale environment, we used three different traces, namely Google Cluster Trace, Mustang and Trinity. Moreover, we have evaluated the proposed model performance using different evaluation metrics to ensure that the proposed model provides the highest accuracy of predicted values. The proposed model is designed and implemented to achieve high accuracy for failure prediction, regardless of whether the model uses a large or small trace size. The evaluation results show that our proposed model achieved a high precision, recall and f1 score.\",\"PeriodicalId\":384949,\"journal\":{\"name\":\"2021 IEEE International Systems Conference (SysCon)\",\"volume\":\"29 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2021-04-15\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"3\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2021 IEEE International Systems Conference (SysCon)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/SysCon48628.2021.9447141\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2021 IEEE International Systems Conference (SysCon)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/SysCon48628.2021.9447141","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 3

摘要

许多云服务提供商在防止硬件和软件故障发生方面面临重大挑战。由于云计算的大规模和异构性质，云服务在其组件中不断遇到故障。以往的研究主要集中在失败作业的特征和对其行为的理解上，而少数研究则集中在失效预测上，重点是提高失效预测模型的准确性。本文介绍了使用深度学习方法的故障预测模型的开发和实现。所提出的模型可以在失败任务发生之前及早识别和检测它们。故障预测模型的关键特征是通过减少失败作业的数量来提高云应用程序的性能。为了研究故障行为并将故障预测应用于大规模环境，我们使用了三种不同的跟踪，即谷歌Cluster Trace、Mustang和Trinity。此外，我们使用不同的评估指标评估了所提出的模型的性能，以确保所提出的模型提供预测值的最高精度。该模型的设计和实现，无论模型使用的轨迹尺寸是大还是小，都能达到较高的故障预测精度。评价结果表明，该模型具有较高的准确率、召回率和f1分数。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

A Failure Prediction Model for Large Scale Cloud Applications using Deep Learning

Many cloud service providers face significant challenges in preventing hardware and software failure from occurring. Due to the large scale and heterogeneous nature of cloud computing, cloud services continue to experience failures in their components. A significant proportion of previous studies have focused on the characterization of failed jobs and understanding their behavior, while a few studies have focused on failure prediction, with a focus on increasing the accuracy of failure prediction models. This paper presents the development and implementation of a failure prediction model using a deep learning approach. The proposed model can identify and detect failed tasks early on before they occur. The key feature of the failure prediction model is to improve the performance of cloud applications by reducing the number of failed jobs. In order to investigate the behavior of failure and apply the prediction of failure to the large-scale environment, we used three different traces, namely Google Cluster Trace, Mustang and Trinity. Moreover, we have evaluated the proposed model performance using different evaluation metrics to ensure that the proposed model provides the highest accuracy of predicted values. The proposed model is designed and implemented to achieve high accuracy for failure prediction, regardless of whether the model uses a large or small trace size. The evaluation results show that our proposed model achieved a high precision, recall and f1 score.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2021 IEEE International Systems Conference (SysCon)

自引率

0.00%

发文量