Machine learning job failure analysis and prediction model for the cloud environment

IF 3 Q2 COMPUTER SCIENCE, INFORMATION SYSTEMS

High-Confidence Computing Pub Date : 2023-09-27 DOI:10.1016/j.hcc.2023.100165

Harikrishna Bommala , Uma Maheswari V. , Rajanikanth Aluvalu , Swapna Mudrakola

{"title":"Machine learning job failure analysis and prediction model for the cloud environment","authors":"Harikrishna Bommala , Uma Maheswari V. , Rajanikanth Aluvalu , Swapna Mudrakola","doi":"10.1016/j.hcc.2023.100165","DOIUrl":null,"url":null,"abstract":"<div><p>Reliable and accessible cloud applications are essential for the future of ubiquitous computing, smart appliances, and electronic health. Owing to the vastness and diversity of the cloud, a most cloud services, both physical and logical services have failed. Using currently accessible traces, we assessed and characterized the behaviors of successful and unsuccessful activities. We devised and implemented a method to forecast which jobs will fail. The proposed method optimizes cloud applications more efficiently in terms of resource usage. Using Google Cluster, Mustang, and Trinity traces, which are publicly available, an in-depth evaluation of the proposed model was conducted. The traces were also fed into several different machine learning models to select the most reliable model. Our efficiency analysis proves that the model performs well in terms of accuracy, F1-score, and recall. Several factors, such as failure of forecasting work, design of scheduling algorithms, modification of priority criteria, and restriction of task resubmission, may increase cloud service dependability and availability.</p></div>","PeriodicalId":100605,"journal":{"name":"High-Confidence Computing","volume":"3 4","pages":"Article 100165"},"PeriodicalIF":3.0000,"publicationDate":"2023-09-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S2667295223000636/pdfft?md5=bfe61b5b8fb7fd53b685e1c9be60171b&pid=1-s2.0-S2667295223000636-main.pdf","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"High-Confidence Computing","FirstCategoryId":"1085","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S2667295223000636","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}

引用次数: 0

Abstract

Reliable and accessible cloud applications are essential for the future of ubiquitous computing, smart appliances, and electronic health. Owing to the vastness and diversity of the cloud, a most cloud services, both physical and logical services have failed. Using currently accessible traces, we assessed and characterized the behaviors of successful and unsuccessful activities. We devised and implemented a method to forecast which jobs will fail. The proposed method optimizes cloud applications more efficiently in terms of resource usage. Using Google Cluster, Mustang, and Trinity traces, which are publicly available, an in-depth evaluation of the proposed model was conducted. The traces were also fed into several different machine learning models to select the most reliable model. Our efficiency analysis proves that the model performs well in terms of accuracy, F1-score, and recall. Several factors, such as failure of forecasting work, design of scheduling algorithms, modification of priority criteria, and restriction of task resubmission, may increase cloud service dependability and availability.

查看原文本刊更多论文

面向云环境的机器学习作业失效分析与预测模型

可靠和可访问的云应用程序对于无处不在的计算、智能设备和电子健康的未来至关重要。由于云的浩瀚和多样性，大多数云服务，包括物理服务和逻辑服务都失败了。使用当前可访问的痕迹，我们评估并描述了成功和不成功活动的行为。我们设计并实施了一种方法来预测哪些工作将失败。提出的方法在资源使用方面更有效地优化了云应用程序。使用公开可用的Google Cluster、Mustang和Trinity跟踪，对所提议的模型进行了深入的评估。这些轨迹也被输入到几个不同的机器学习模型中，以选择最可靠的模型。我们的效率分析证明，该模型在准确率、f1分数和召回率方面表现良好。预测工作的失败、调度算法的设计、优先级标准的修改和任务重新提交的限制等几个因素可能会增加云服务的可靠性和可用性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

High-Confidence Computing

CiteScore

4.70

自引率

0.00%

发文量