Hunting Killer Tasks for Cloud System through Machine Learning: A Google Cluster Case Study

IEEE International Conference on Software Quality, Reliability and Security : proceedings. IEEE International Conference on Software Quality, Reliability and Security Pub Date : 2016-08-01 DOI:10.1109/QRS.2016.11

Hongyan Tang, Ying Li, Tong Jia, Zhonghai Wu

{"title":"Hunting Killer Tasks for Cloud System through Machine Learning: A Google Cluster Case Study","authors":"Hongyan Tang, Ying Li, Tong Jia, Zhonghai Wu","doi":"10.1109/QRS.2016.11","DOIUrl":null,"url":null,"abstract":"Motivated by frequent failures in cloud computing systems, we analyze failure frequency and failure continuity of tasks from the Google cloud cluster, and find what we call killer tasks that suffer from frequent failures and repeated rescheduling. Killer tasks cause unnecessary resource wasting and significant increase of scheduling workloads, which can be a big concern in cloud systems. We aim to recognize killer tasks at the very early stage of their occurrence so that they can be addressed proactively instead of being rescheduled repeatedly, so as to promote reliability and save resources. To recognize killer tasks from a large amount of tasks in real time is really challenging. In this paper, we first investigate characteristics and behavior patterns of killer tasks and then develop two machine learning based methods, K-HUNTER and C-HUNTER, for online recognition of killer tasks. The empirical results show that our approach performs at 97% of precision in recognizing killer tasks with an 89% timing advance and 88% of resource saving for the cloud system on average.","PeriodicalId":92210,"journal":{"name":"IEEE International Conference on Software Quality, Reliability and Security : proceedings. IEEE International Conference on Software Quality, Reliability and Security","volume":"20 1","pages":"1-12"},"PeriodicalIF":0.0000,"publicationDate":"2016-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE International Conference on Software Quality, Reliability and Security : proceedings. IEEE International Conference on Software Quality, Reliability and Security","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/QRS.2016.11","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 2

Abstract

Motivated by frequent failures in cloud computing systems, we analyze failure frequency and failure continuity of tasks from the Google cloud cluster, and find what we call killer tasks that suffer from frequent failures and repeated rescheduling. Killer tasks cause unnecessary resource wasting and significant increase of scheduling workloads, which can be a big concern in cloud systems. We aim to recognize killer tasks at the very early stage of their occurrence so that they can be addressed proactively instead of being rescheduled repeatedly, so as to promote reliability and save resources. To recognize killer tasks from a large amount of tasks in real time is really challenging. In this paper, we first investigate characteristics and behavior patterns of killer tasks and then develop two machine learning based methods, K-HUNTER and C-HUNTER, for online recognition of killer tasks. The empirical results show that our approach performs at 97% of precision in recognizing killer tasks with an 89% timing advance and 88% of resource saving for the cloud system on average.

查看原文本刊更多论文

通过机器学习为云系统寻找杀手级任务:一个Google集群案例研究

在云计算系统频繁故障的激励下，我们分析了Google云集群中任务的故障频率和故障连续性，并发现了我们所谓的杀手级任务，这些任务遭受频繁故障和反复重新调度。杀手级任务会导致不必要的资源浪费和调度工作负载的显著增加，这在云系统中可能是一个大问题。我们的目标是在杀手级任务发生的最早期就发现它们，从而主动解决它们，而不是重复地重新调度，从而提高可靠性并节省资源。从大量的任务中实时识别出杀手级任务是非常具有挑战性的。在本文中，我们首先研究了杀手任务的特征和行为模式，然后开发了两种基于机器学习的方法，K-HUNTER和C-HUNTER，用于在线识别杀手任务。实证结果表明，我们的方法在识别杀手级任务方面的准确率为97%，时间提前89%，云系统平均节省资源88%。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

IEEE International Conference on Software Quality, Reliability and Security : proceedings. IEEE International Conference on Software Quality, Reliability and Security

自引率

0.00%

发文量