A. Alper Goksoy;Alish Kanani;Satrajit Chatterjee;Umit Ogras
{"title":"Runtime Monitoring of ML-Based Scheduling Algorithms Toward Robust Domain-Specific SoCs","authors":"A. Alper Goksoy;Alish Kanani;Satrajit Chatterjee;Umit Ogras","doi":"10.1109/TCAD.2024.3445815","DOIUrl":null,"url":null,"abstract":"Machine learning (ML) algorithms are being rapidly adopted to perform dynamic resource management tasks in heterogeneous system on chips. For example, ML-based task schedulers can make quick, high-quality decisions at runtime. Like any ML model, these offline-trained policies depend critically on the representative power of the training data. Hence, their performance may diminish or even catastrophically fail under unknown workloads, especially new applications. This article proposes a novel framework to continuously monitor the system to detect unforeseen scenarios using a gradient-based generalization metric called coherence. The proposed framework accurately determines whether the current policy generalizes to new inputs. If not, it incrementally trains the ML scheduler to ensure the robustness of the task-scheduling decisions. The proposed framework is evaluated thoroughly with a domain-specific SoC and six real-world applications. It can detect whether the trained scheduler generalizes to the current workload with 88.75%–98.39% accuracy. Furthermore, it enables \n<inline-formula> <tex-math>$1.1\\times -14\\times $ </tex-math></inline-formula>\n faster execution time when the scheduler is incrementally trained. Finally, overhead analysis performed on an Nvidia Jetson Xavier NX board shows that the proposed framework can run as a real-time background task.","PeriodicalId":13251,"journal":{"name":"IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems","volume":"43 11","pages":"4202-4213"},"PeriodicalIF":2.7000,"publicationDate":"2024-11-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/10745816/","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, HARDWARE & ARCHITECTURE","Score":null,"Total":0}
引用次数: 0
Abstract
Machine learning (ML) algorithms are being rapidly adopted to perform dynamic resource management tasks in heterogeneous system on chips. For example, ML-based task schedulers can make quick, high-quality decisions at runtime. Like any ML model, these offline-trained policies depend critically on the representative power of the training data. Hence, their performance may diminish or even catastrophically fail under unknown workloads, especially new applications. This article proposes a novel framework to continuously monitor the system to detect unforeseen scenarios using a gradient-based generalization metric called coherence. The proposed framework accurately determines whether the current policy generalizes to new inputs. If not, it incrementally trains the ML scheduler to ensure the robustness of the task-scheduling decisions. The proposed framework is evaluated thoroughly with a domain-specific SoC and six real-world applications. It can detect whether the trained scheduler generalizes to the current workload with 88.75%–98.39% accuracy. Furthermore, it enables
$1.1\times -14\times $
faster execution time when the scheduler is incrementally trained. Finally, overhead analysis performed on an Nvidia Jetson Xavier NX board shows that the proposed framework can run as a real-time background task.
机器学习(ML)算法正被迅速用于执行异构片上系统中的动态资源管理任务。例如,基于 ML 的任务调度器可以在运行时做出快速、高质量的决策。与任何 ML 模型一样,这些离线训练的策略在很大程度上取决于训练数据的代表性。因此,在未知工作负载(尤其是新应用)下,它们的性能可能会下降,甚至出现灾难性故障。本文提出了一个新颖的框架,利用基于梯度的泛化指标--一致性,持续监控系统以检测不可预见的情况。所提出的框架能准确判断当前策略是否能泛化到新的输入。如果不能,它将逐步训练 ML 调度器,以确保任务调度决策的稳健性。我们利用特定领域的 SoC 和六个实际应用对所提出的框架进行了全面评估。它能以 88.75%-98.39% 的准确率检测出训练有素的调度程序是否适用于当前工作负载。此外,在对调度器进行增量训练时,它还能使执行时间缩短1.1倍-14倍。最后,在Nvidia Jetson Xavier NX板上进行的开销分析表明,所提出的框架可以作为实时后台任务运行。
期刊介绍:
The purpose of this Transactions is to publish papers of interest to individuals in the area of computer-aided design of integrated circuits and systems composed of analog, digital, mixed-signal, optical, or microwave components. The aids include methods, models, algorithms, and man-machine interfaces for system-level, physical and logical design including: planning, synthesis, partitioning, modeling, simulation, layout, verification, testing, hardware-software co-design and documentation of integrated circuit and system designs of all complexities. Design tools and techniques for evaluating and designing integrated circuits and systems for metrics such as performance, power, reliability, testability, and security are a focus.