When and How to Retrain Machine Learning-based Cloud Management Systems

2022 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW) Pub Date : 2022-05-01 DOI:10.1109/IPDPSW55747.2022.00120

Lidia Kidane, P. Townend, Thijs Metsch, E. Elmroth

{"title":"When and How to Retrain Machine Learning-based Cloud Management Systems","authors":"Lidia Kidane, P. Townend, Thijs Metsch, E. Elmroth","doi":"10.1109/IPDPSW55747.2022.00120","DOIUrl":null,"url":null,"abstract":"Cloud management systems increasingly rely on machine learning (ML) models to predict incoming workload rates, load, and other system behaviours for efficient dynamic resource management. Current state-of-the-art prediction models demonstrate high accuracy but assume that data patterns remain stable. However, in production use, systems may face hardware upgrades, changes in user behaviour etc. that lead to concept drifts - significant changes in the characteristics of data streams over time. To mitigate prediction deterioration, ML models need to be updated - but questions of when and how to best retrain these models are unsolved in the context of cloud management. We present a pilot study that addresses these questions for one of the most common models for adaptive prediction - Long Short Term Memory (LSTM) - using synthetic and real-world workload data. Our analysis of when to retrain explores approaches for detecting when retraining is required using both concept drift detection and prediction error thresholds, and at what point retraining should actually take place. Our analysis of how to retrain focuses on the data required for retraining, and what proportion should be taken from before and after the need for retraining is detected. We present initial results that indicate that retraining of existing models can achieve prediction accuracy close to that of newly trained models but for much less cost, and present initial advice for how to provide cloud management systems with support for automatic retraining of ML-based models.","PeriodicalId":286968,"journal":{"name":"2022 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)","volume":"149 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2022 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/IPDPSW55747.2022.00120","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 1

Abstract

Cloud management systems increasingly rely on machine learning (ML) models to predict incoming workload rates, load, and other system behaviours for efficient dynamic resource management. Current state-of-the-art prediction models demonstrate high accuracy but assume that data patterns remain stable. However, in production use, systems may face hardware upgrades, changes in user behaviour etc. that lead to concept drifts - significant changes in the characteristics of data streams over time. To mitigate prediction deterioration, ML models need to be updated - but questions of when and how to best retrain these models are unsolved in the context of cloud management. We present a pilot study that addresses these questions for one of the most common models for adaptive prediction - Long Short Term Memory (LSTM) - using synthetic and real-world workload data. Our analysis of when to retrain explores approaches for detecting when retraining is required using both concept drift detection and prediction error thresholds, and at what point retraining should actually take place. Our analysis of how to retrain focuses on the data required for retraining, and what proportion should be taken from before and after the need for retraining is detected. We present initial results that indicate that retraining of existing models can achieve prediction accuracy close to that of newly trained models but for much less cost, and present initial advice for how to provide cloud management systems with support for automatic retraining of ML-based models.

查看原文本刊更多论文

何时以及如何重新培训基于机器学习的云管理系统

云管理系统越来越依赖于机器学习(ML)模型来预测传入的工作负载率、负载和其他系统行为，以实现高效的动态资源管理。目前最先进的预测模型显示出很高的准确性，但假设数据模式保持稳定。然而，在生产使用中，系统可能会面临硬件升级、用户行为改变等导致概念漂移的问题——随着时间的推移，数据流特征会发生重大变化。为了减轻预测的恶化，机器学习模型需要更新，但是在云管理的背景下，何时以及如何最好地重新训练这些模型的问题尚未得到解决。我们提出了一项试点研究，利用合成和现实世界的工作负载数据，为适应性预测最常见的模型之一-长短期记忆(LSTM)解决了这些问题。我们对何时进行再训练的分析探讨了使用概念漂移检测和预测误差阈值来检测何时需要再训练的方法，以及应该在什么时候进行再训练。我们对如何进行再培训的分析主要集中在再培训所需的数据，以及在检测到再培训需求之前和之后应该采取的比例。我们提出的初步结果表明，现有模型的再训练可以达到接近新训练模型的预测精度，但成本要低得多，并就如何为云管理系统提供支持基于ml的模型的自动再训练提出了初步建议。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2022 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)

自引率

0.00%

发文量