基于npu加速的异构多核热优化模拟学习

IF 2.2 4区计算机科学 Q3 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

ACM Transactions on Design Automation of Electronic Systems Pub Date : 2023-10-05 DOI:10.1145/3626320

Martin Rapp, Heba Khdr, Nikita Krohmer, Jörg Henkel

{"title":"基于npu加速的异构多核热优化模拟学习","authors":"Martin Rapp, Heba Khdr, Nikita Krohmer, Jörg Henkel","doi":"10.1145/3626320","DOIUrl":null,"url":null,"abstract":"Thermal optimization of a heterogeneous clustered multi-core processor under user-defined quality of service (QoS) targets requires application migration and dynamic voltage and frequency scaling (DVFS). However, selecting the core to execute each application and the voltage/frequency (V/f) levels of each cluster is a complex problem because 1) the diverse characteristics and QoS targets of applications require different optimizations, and 2) per-cluster DVFS requires a global optimization considering all running applications. State-of-the-art resource management for power or temperature minimization either relies on measurements that are commonly not available (such as power) or fails to consider all the dimensions of the optimization (e.g., by using simplified analytical models). To solve this, machine learning (ML) methods can be employed. In particular, imitation learning (IL) leverages the optimality of an oracle policy, yet at low run-time overhead, by training a model from oracle demonstrations. We are the first to employ IL for temperature minimization under QoS targets. We tackle the complexity by training neural network (NN) at design time and accelerate the run-time NN inference using a neural processing unit (NPU). While such NN accelerators are becoming increasingly widespread, they are so far only used to accelerate user applications. In contrast, we use for the first time an existing accelerator on a real platform to accelerate NN-based resource management. To show the superiority of IL compared to reinforcement learning (RL) in our targeted problem, we also develop multi-agent RL-based management. Our evaluation on a HiKey 970 board with an Arm big.LITTLE CPU and NPU shows that IL achieves significant temperature reductions at a negligible run-time overhead. We compare TOP-IL against several techniques. Compared to ondemand Linux governor, TOP-IL reduces the average temperature by up to 17 °C at minimal QoS violations for both techniques. Compared to the RL policy, our TOP-IL achieves 63 % to 89 % fewer QoS violations while resulting similar average temperatures. Moreover, TOP-IL outperforms the RL policy in terms of stability. We additionally show that our IL-based technique also generalizes to different software (unseen applications) and even hardware (different cooling) than used for training.","PeriodicalId":50944,"journal":{"name":"ACM Transactions on Design Automation of Electronic Systems","volume":null,"pages":null},"PeriodicalIF":2.2000,"publicationDate":"2023-10-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"NPU-Accelerated Imitation Learningfor Thermal Optimizationof QoS-Constrained Heterogeneous Multi-Cores\",\"authors\":\"Martin Rapp, Heba Khdr, Nikita Krohmer, Jörg Henkel\",\"doi\":\"10.1145/3626320\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Thermal optimization of a heterogeneous clustered multi-core processor under user-defined quality of service (QoS) targets requires application migration and dynamic voltage and frequency scaling (DVFS). However, selecting the core to execute each application and the voltage/frequency (V/f) levels of each cluster is a complex problem because 1) the diverse characteristics and QoS targets of applications require different optimizations, and 2) per-cluster DVFS requires a global optimization considering all running applications. State-of-the-art resource management for power or temperature minimization either relies on measurements that are commonly not available (such as power) or fails to consider all the dimensions of the optimization (e.g., by using simplified analytical models). To solve this, machine learning (ML) methods can be employed. In particular, imitation learning (IL) leverages the optimality of an oracle policy, yet at low run-time overhead, by training a model from oracle demonstrations. We are the first to employ IL for temperature minimization under QoS targets. We tackle the complexity by training neural network (NN) at design time and accelerate the run-time NN inference using a neural processing unit (NPU). While such NN accelerators are becoming increasingly widespread, they are so far only used to accelerate user applications. In contrast, we use for the first time an existing accelerator on a real platform to accelerate NN-based resource management. To show the superiority of IL compared to reinforcement learning (RL) in our targeted problem, we also develop multi-agent RL-based management. Our evaluation on a HiKey 970 board with an Arm big.LITTLE CPU and NPU shows that IL achieves significant temperature reductions at a negligible run-time overhead. We compare TOP-IL against several techniques. Compared to ondemand Linux governor, TOP-IL reduces the average temperature by up to 17 °C at minimal QoS violations for both techniques. Compared to the RL policy, our TOP-IL achieves 63 % to 89 % fewer QoS violations while resulting similar average temperatures. Moreover, TOP-IL outperforms the RL policy in terms of stability. We additionally show that our IL-based technique also generalizes to different software (unseen applications) and even hardware (different cooling) than used for training.\",\"PeriodicalId\":50944,\"journal\":{\"name\":\"ACM Transactions on Design Automation of Electronic Systems\",\"volume\":null,\"pages\":null},\"PeriodicalIF\":2.2000,\"publicationDate\":\"2023-10-05\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"ACM Transactions on Design Automation of Electronic Systems\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3626320\",\"RegionNum\":4,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q3\",\"JCRName\":\"COMPUTER SCIENCE, HARDWARE & ARCHITECTURE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"ACM Transactions on Design Automation of Electronic Systems","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3626320","RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"COMPUTER SCIENCE, HARDWARE & ARCHITECTURE","Score":null,"Total":0}

引用次数: 0

摘要

基于用户自定义服务质量(QoS)目标的异构集群多核处理器热优化需要应用迁移和动态电压频率缩放(DVFS)。然而，选择执行每个应用程序的核心和每个集群的电压/频率(V/f)水平是一个复杂的问题，因为1)应用程序的不同特征和QoS目标需要不同的优化，2)每个集群的DVFS需要考虑所有正在运行的应用程序的全局优化。最先进的功率或温度最小化资源管理要么依赖于通常不可用的测量(例如功率)，要么没有考虑优化的所有维度(例如，通过使用简化的分析模型)。为了解决这个问题，可以使用机器学习(ML)方法。特别是，模仿学习(IL)利用oracle策略的最优性，但在低运行时开销的情况下，通过从oracle演示中训练模型。我们是第一个在QoS目标下使用IL来实现温度最小化的。我们通过在设计时训练神经网络(NN)来解决复杂性问题，并使用神经处理单元(NPU)加速神经网络的运行时推理。虽然这种神经网络加速器正变得越来越普遍，但到目前为止，它们只用于加速用户应用。相比之下，我们首次在真实平台上使用现有的加速器来加速基于神经网络的资源管理。为了在我们的目标问题中显示IL与强化学习(RL)相比的优势，我们还开发了基于多智能体强化学习的管理。我们对Arm大的HiKey 970板的评估。LITTLE CPU和NPU表明，IL在可以忽略不计的运行时开销下实现了显著的温度降低。我们将TOP-IL与几种技术进行比较。与随需应变的Linux调控器相比，TOP-IL在对两种技术的QoS违反最小的情况下，将平均温度降低了17°C。与RL策略相比，我们的TOP-IL在产生相似的平均温度的同时减少了63%到89%的QoS违规。此外，TOP-IL在稳定性方面优于RL政策。我们还表明，我们基于il的技术也可以推广到不同的软件(看不见的应用程序)，甚至硬件(不同的冷却)，而不是用于训练。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

NPU-Accelerated Imitation Learningfor Thermal Optimizationof QoS-Constrained Heterogeneous Multi-Cores

Thermal optimization of a heterogeneous clustered multi-core processor under user-defined quality of service (QoS) targets requires application migration and dynamic voltage and frequency scaling (DVFS). However, selecting the core to execute each application and the voltage/frequency (V/f) levels of each cluster is a complex problem because 1) the diverse characteristics and QoS targets of applications require different optimizations, and 2) per-cluster DVFS requires a global optimization considering all running applications. State-of-the-art resource management for power or temperature minimization either relies on measurements that are commonly not available (such as power) or fails to consider all the dimensions of the optimization (e.g., by using simplified analytical models). To solve this, machine learning (ML) methods can be employed. In particular, imitation learning (IL) leverages the optimality of an oracle policy, yet at low run-time overhead, by training a model from oracle demonstrations. We are the first to employ IL for temperature minimization under QoS targets. We tackle the complexity by training neural network (NN) at design time and accelerate the run-time NN inference using a neural processing unit (NPU). While such NN accelerators are becoming increasingly widespread, they are so far only used to accelerate user applications. In contrast, we use for the first time an existing accelerator on a real platform to accelerate NN-based resource management. To show the superiority of IL compared to reinforcement learning (RL) in our targeted problem, we also develop multi-agent RL-based management. Our evaluation on a HiKey 970 board with an Arm big.LITTLE CPU and NPU shows that IL achieves significant temperature reductions at a negligible run-time overhead. We compare TOP-IL against several techniques. Compared to ondemand Linux governor, TOP-IL reduces the average temperature by up to 17 °C at minimal QoS violations for both techniques. Compared to the RL policy, our TOP-IL achieves 63 % to 89 % fewer QoS violations while resulting similar average temperatures. Moreover, TOP-IL outperforms the RL policy in terms of stability. We additionally show that our IL-based technique also generalizes to different software (unseen applications) and even hardware (different cooling) than used for training.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

ACM Transactions on Design Automation of Electronic Systems 工程技术-计算机：软件工程

CiteScore

3.20

自引率

7.10%

发文量

105

审稿时长

3 months

期刊介绍： TODAES is a premier ACM journal in design and automation of electronic systems. It publishes innovative work documenting significant research and development advances on the specification, design, analysis, simulation, testing, and evaluation of electronic systems, emphasizing a computer science/engineering orientation. Both theoretical analysis and practical solutions are welcome.