Lidia Kidane, P. Townend, Thijs Metsch, E. Elmroth
{"title":"When and How to Retrain Machine Learning-based Cloud Management Systems","authors":"Lidia Kidane, P. Townend, Thijs Metsch, E. Elmroth","doi":"10.1109/IPDPSW55747.2022.00120","DOIUrl":null,"url":null,"abstract":"Cloud management systems increasingly rely on machine learning (ML) models to predict incoming workload rates, load, and other system behaviours for efficient dynamic resource management. Current state-of-the-art prediction models demonstrate high accuracy but assume that data patterns remain stable. However, in production use, systems may face hardware upgrades, changes in user behaviour etc. that lead to concept drifts - significant changes in the characteristics of data streams over time. To mitigate prediction deterioration, ML models need to be updated - but questions of when and how to best retrain these models are unsolved in the context of cloud management. We present a pilot study that addresses these questions for one of the most common models for adaptive prediction - Long Short Term Memory (LSTM) - using synthetic and real-world workload data. Our analysis of when to retrain explores approaches for detecting when retraining is required using both concept drift detection and prediction error thresholds, and at what point retraining should actually take place. Our analysis of how to retrain focuses on the data required for retraining, and what proportion should be taken from before and after the need for retraining is detected. We present initial results that indicate that retraining of existing models can achieve prediction accuracy close to that of newly trained models but for much less cost, and present initial advice for how to provide cloud management systems with support for automatic retraining of ML-based models.","PeriodicalId":286968,"journal":{"name":"2022 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)","volume":"149 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2022 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/IPDPSW55747.2022.00120","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 1
Abstract
Cloud management systems increasingly rely on machine learning (ML) models to predict incoming workload rates, load, and other system behaviours for efficient dynamic resource management. Current state-of-the-art prediction models demonstrate high accuracy but assume that data patterns remain stable. However, in production use, systems may face hardware upgrades, changes in user behaviour etc. that lead to concept drifts - significant changes in the characteristics of data streams over time. To mitigate prediction deterioration, ML models need to be updated - but questions of when and how to best retrain these models are unsolved in the context of cloud management. We present a pilot study that addresses these questions for one of the most common models for adaptive prediction - Long Short Term Memory (LSTM) - using synthetic and real-world workload data. Our analysis of when to retrain explores approaches for detecting when retraining is required using both concept drift detection and prediction error thresholds, and at what point retraining should actually take place. Our analysis of how to retrain focuses on the data required for retraining, and what proportion should be taken from before and after the need for retraining is detected. We present initial results that indicate that retraining of existing models can achieve prediction accuracy close to that of newly trained models but for much less cost, and present initial advice for how to provide cloud management systems with support for automatic retraining of ML-based models.