{"title":"OptimML:联合控制推理延迟和服务器功耗,优化 ML 性能","authors":"Guoyu Chen, Xiaorui Wang","doi":"10.1145/3661825","DOIUrl":null,"url":null,"abstract":"<p>Power capping is an important technique for high-density servers to safely oversubscribe the power infrastructure in a data center. However, power capping is commonly accomplished by dynamically lowering the server processors’ frequency levels, which can result in degraded application performance. For servers that run important machine learning (ML) applications with Service-Level Objective (SLO) requirements, inference performance such as recognition accuracy must be optimized within a certain latency constraint, which demands high server performance. In order to achieve the best inference accuracy under the desired latency and server power constraints, this paper proposes OptimML, a multi-input-multi-output (MIMO) control framework that jointly controls both inference latency and server power consumption, by flexibly adjusting the machine learning model size (and so its required computing resources) when server frequency needs to be lowered for power capping. Our results on a hardware testbed with widely adopted ML framework (including PyTorch, TensorFlow, and MXNet) show that OptimML achieves higher inference accuracy compared with several well-designed baselines, while respecting both latency and power constraints. Furthermore, an adaptive control scheme with online model switching and estimation is designed to achieve analytic assurance of control accuracy and system stability, even in the face of significant workload/hardware variations.</p>","PeriodicalId":50919,"journal":{"name":"ACM Transactions on Autonomous and Adaptive Systems","volume":"63 1","pages":""},"PeriodicalIF":2.2000,"publicationDate":"2024-05-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"OptimML: Joint Control of Inference Latency and Server Power Consumption for ML Performance Optimization\",\"authors\":\"Guoyu Chen, Xiaorui Wang\",\"doi\":\"10.1145/3661825\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p>Power capping is an important technique for high-density servers to safely oversubscribe the power infrastructure in a data center. However, power capping is commonly accomplished by dynamically lowering the server processors’ frequency levels, which can result in degraded application performance. For servers that run important machine learning (ML) applications with Service-Level Objective (SLO) requirements, inference performance such as recognition accuracy must be optimized within a certain latency constraint, which demands high server performance. In order to achieve the best inference accuracy under the desired latency and server power constraints, this paper proposes OptimML, a multi-input-multi-output (MIMO) control framework that jointly controls both inference latency and server power consumption, by flexibly adjusting the machine learning model size (and so its required computing resources) when server frequency needs to be lowered for power capping. Our results on a hardware testbed with widely adopted ML framework (including PyTorch, TensorFlow, and MXNet) show that OptimML achieves higher inference accuracy compared with several well-designed baselines, while respecting both latency and power constraints. Furthermore, an adaptive control scheme with online model switching and estimation is designed to achieve analytic assurance of control accuracy and system stability, even in the face of significant workload/hardware variations.</p>\",\"PeriodicalId\":50919,\"journal\":{\"name\":\"ACM Transactions on Autonomous and Adaptive Systems\",\"volume\":\"63 1\",\"pages\":\"\"},\"PeriodicalIF\":2.2000,\"publicationDate\":\"2024-05-07\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"ACM Transactions on Autonomous and Adaptive Systems\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://doi.org/10.1145/3661825\",\"RegionNum\":4,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q3\",\"JCRName\":\"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"ACM Transactions on Autonomous and Adaptive Systems","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1145/3661825","RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
引用次数: 0
摘要
功率封顶是高密度服务器的一项重要技术,可安全地超额分配数据中心的电力基础设施。然而,功率封顶通常是通过动态降低服务器处理器的频率水平来实现的,这会导致应用性能下降。对于运行有服务级目标(SLO)要求的重要机器学习(ML)应用的服务器来说,必须在一定的延迟约束内优化推理性能(如识别准确率),这就要求服务器具有很高的性能。为了在所需的延迟和服务器功耗限制条件下实现最佳推理精度,本文提出了多输入多输出(MIMO)控制框架 OptimML,当服务器频率需要降低以达到功耗上限时,通过灵活调整机器学习模型的大小(因此也调整了所需的计算资源)来共同控制推理延迟和服务器功耗。我们在采用广泛应用的 ML 框架(包括 PyTorch、TensorFlow 和 MXNet)的硬件测试平台上取得的结果表明,与几种精心设计的基线相比,OptimML 实现了更高的推理精度,同时遵守了延迟和功耗限制。此外,我们还设计了一种具有在线模型切换和估计功能的自适应控制方案,以实现对控制精度和系统稳定性的分析保证,即使面对显著的工作负载/硬件变化也是如此。
OptimML: Joint Control of Inference Latency and Server Power Consumption for ML Performance Optimization
Power capping is an important technique for high-density servers to safely oversubscribe the power infrastructure in a data center. However, power capping is commonly accomplished by dynamically lowering the server processors’ frequency levels, which can result in degraded application performance. For servers that run important machine learning (ML) applications with Service-Level Objective (SLO) requirements, inference performance such as recognition accuracy must be optimized within a certain latency constraint, which demands high server performance. In order to achieve the best inference accuracy under the desired latency and server power constraints, this paper proposes OptimML, a multi-input-multi-output (MIMO) control framework that jointly controls both inference latency and server power consumption, by flexibly adjusting the machine learning model size (and so its required computing resources) when server frequency needs to be lowered for power capping. Our results on a hardware testbed with widely adopted ML framework (including PyTorch, TensorFlow, and MXNet) show that OptimML achieves higher inference accuracy compared with several well-designed baselines, while respecting both latency and power constraints. Furthermore, an adaptive control scheme with online model switching and estimation is designed to achieve analytic assurance of control accuracy and system stability, even in the face of significant workload/hardware variations.
期刊介绍:
TAAS addresses research on autonomous and adaptive systems being undertaken by an increasingly interdisciplinary research community -- and provides a common platform under which this work can be published and disseminated. TAAS encourages contributions aimed at supporting the understanding, development, and control of such systems and of their behaviors.
TAAS addresses research on autonomous and adaptive systems being undertaken by an increasingly interdisciplinary research community - and provides a common platform under which this work can be published and disseminated. TAAS encourages contributions aimed at supporting the understanding, development, and control of such systems and of their behaviors. Contributions are expected to be based on sound and innovative theoretical models, algorithms, engineering and programming techniques, infrastructures and systems, or technological and application experiences.