{"title":"MLCD: Machine Learning-Based Code Version and Device Selection for Heterogeneous Systems","authors":"Kaiwen Cao;Hanchen Ye;Yihan Pang;Deming Chen","doi":"10.1109/TC.2025.3558606","DOIUrl":null,"url":null,"abstract":"Heterogeneous systems with hardware accelerators are increasingly common, and various optimized implementations/algorithms exist for computation kernels. However, no single best combination of <i>code version and device</i> (C&D) can outperform others across all input cases, demanding a method to select the best C&D pair based on input. We present <u>m</u>achine <u>l</u>earning-based <u>c</u>ode version and <u>d</u>evice selection method, named <i>MLCD</i>, that uses input data characteristics to select the best C&D pair dynamically. We also apply active learning to reduce the number of samples needed to construct the model. Demonstrated on two different CPU-GPU systems, MLCD achieves near-optimal speed-up regardless of which systems tested. Concretely, reporting results from system one with mid-end hardwares, it achieves 99.9%, 95.6%, 99.9%, and 98.6% of the optimal acceleration attainable through the ideal choice of C&D pairs in General Matrix Multiply, PageRank, N-body Simulation, and K-Motif Counting, respectively. MLCD achieves a speed-up of 2.57<inline-formula><tex-math>$\\boldsymbol{\\times}$</tex-math></inline-formula>, 1.58<inline-formula><tex-math>$\\boldsymbol{\\times}$</tex-math></inline-formula>, 2.68<inline-formula><tex-math>$\\boldsymbol{\\times}$</tex-math></inline-formula>, and 1.09<inline-formula><tex-math>$\\boldsymbol{\\times}$</tex-math></inline-formula> compared to baselines without MLCD. Additionally, MLCD handles end-to-end applications, achieving up to 10% and 46% speed-up over GPU-only and CPU-only solutions with Graph Neural Networks. Furthermore, it achieves 7.28<inline-formula><tex-math>$\\boldsymbol{\\times}$</tex-math></inline-formula> average speed-up in execution latency over the state-of-the-art approach and determines suitable code versions for unseen input <inline-formula><tex-math>$10^{8}-10^{10}\\boldsymbol{\\times}$</tex-math></inline-formula> faster.","PeriodicalId":13087,"journal":{"name":"IEEE Transactions on Computers","volume":"74 7","pages":"2417-2430"},"PeriodicalIF":3.8000,"publicationDate":"2025-04-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Computers","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/10955449/","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, HARDWARE & ARCHITECTURE","Score":null,"Total":0}
引用次数: 0
Abstract
Heterogeneous systems with hardware accelerators are increasingly common, and various optimized implementations/algorithms exist for computation kernels. However, no single best combination of code version and device (C&D) can outperform others across all input cases, demanding a method to select the best C&D pair based on input. We present machine learning-based code version and device selection method, named MLCD, that uses input data characteristics to select the best C&D pair dynamically. We also apply active learning to reduce the number of samples needed to construct the model. Demonstrated on two different CPU-GPU systems, MLCD achieves near-optimal speed-up regardless of which systems tested. Concretely, reporting results from system one with mid-end hardwares, it achieves 99.9%, 95.6%, 99.9%, and 98.6% of the optimal acceleration attainable through the ideal choice of C&D pairs in General Matrix Multiply, PageRank, N-body Simulation, and K-Motif Counting, respectively. MLCD achieves a speed-up of 2.57$\boldsymbol{\times}$, 1.58$\boldsymbol{\times}$, 2.68$\boldsymbol{\times}$, and 1.09$\boldsymbol{\times}$ compared to baselines without MLCD. Additionally, MLCD handles end-to-end applications, achieving up to 10% and 46% speed-up over GPU-only and CPU-only solutions with Graph Neural Networks. Furthermore, it achieves 7.28$\boldsymbol{\times}$ average speed-up in execution latency over the state-of-the-art approach and determines suitable code versions for unseen input $10^{8}-10^{10}\boldsymbol{\times}$ faster.
期刊介绍:
The IEEE Transactions on Computers is a monthly publication with a wide distribution to researchers, developers, technical managers, and educators in the computer field. It publishes papers on research in areas of current interest to the readers. These areas include, but are not limited to, the following: a) computer organizations and architectures; b) operating systems, software systems, and communication protocols; c) real-time systems and embedded systems; d) digital devices, computer components, and interconnection networks; e) specification, design, prototyping, and testing methods and tools; f) performance, fault tolerance, reliability, security, and testability; g) case studies and experimental and theoretical evaluations; and h) new and important applications and trends.