基于机器学习的异构系统代码版本和设备选择

IF 3.8 2区计算机科学 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

IEEE Transactions on Computers Pub Date : 2025-04-08 DOI:10.1109/TC.2025.3558606

Kaiwen Cao;Hanchen Ye;Yihan Pang;Deming Chen

{"title":"基于机器学习的异构系统代码版本和设备选择","authors":"Kaiwen Cao;Hanchen Ye;Yihan Pang;Deming Chen","doi":"10.1109/TC.2025.3558606","DOIUrl":null,"url":null,"abstract":"Heterogeneous systems with hardware accelerators are increasingly common, and various optimized implementations/algorithms exist for computation kernels. However, no single best combination of code version and device (C&D) can outperform others across all input cases, demanding a method to select the best C&D pair based on input. We present machine learning-based code version and device selection method, named MLCD, that uses input data characteristics to select the best C&D pair dynamically. We also apply active learning to reduce the number of samples needed to construct the model. Demonstrated on two different CPU-GPU systems, MLCD achieves near-optimal speed-up regardless of which systems tested. Concretely, reporting results from system one with mid-end hardwares, it achieves 99.9%, 95.6%, 99.9%, and 98.6% of the optimal acceleration attainable through the ideal choice of C&D pairs in General Matrix Multiply, PageRank, N-body Simulation, and K-Motif Counting, respectively. MLCD achieves a speed-up of 2.57<inline-formula><tex-math>$\\boldsymbol{\\times}$</tex-math></inline-formula>, 1.58<inline-formula><tex-math>$\\boldsymbol{\\times}$</tex-math></inline-formula>, 2.68<inline-formula><tex-math>$\\boldsymbol{\\times}$</tex-math></inline-formula>, and 1.09<inline-formula><tex-math>$\\boldsymbol{\\times}$</tex-math></inline-formula> compared to baselines without MLCD. Additionally, MLCD handles end-to-end applications, achieving up to 10% and 46% speed-up over GPU-only and CPU-only solutions with Graph Neural Networks. Furthermore, it achieves 7.28<inline-formula><tex-math>$\\boldsymbol{\\times}$</tex-math></inline-formula> average speed-up in execution latency over the state-of-the-art approach and determines suitable code versions for unseen input <inline-formula><tex-math>$10^{8}-10^{10}\\boldsymbol{\\times}$</tex-math></inline-formula> faster.","PeriodicalId":13087,"journal":{"name":"IEEE Transactions on Computers","volume":"74 7","pages":"2417-2430"},"PeriodicalIF":3.8000,"publicationDate":"2025-04-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"MLCD: Machine Learning-Based Code Version and Device Selection for Heterogeneous Systems\",\"authors\":\"Kaiwen Cao;Hanchen Ye;Yihan Pang;Deming Chen\",\"doi\":\"10.1109/TC.2025.3558606\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Heterogeneous systems with hardware accelerators are increasingly common, and various optimized implementations/algorithms exist for computation kernels. However, no single best combination of code version and device (C&D) can outperform others across all input cases, demanding a method to select the best C&D pair based on input. We present machine learning-based code version and device selection method, named MLCD, that uses input data characteristics to select the best C&D pair dynamically. We also apply active learning to reduce the number of samples needed to construct the model. Demonstrated on two different CPU-GPU systems, MLCD achieves near-optimal speed-up regardless of which systems tested. Concretely, reporting results from system one with mid-end hardwares, it achieves 99.9%, 95.6%, 99.9%, and 98.6% of the optimal acceleration attainable through the ideal choice of C&D pairs in General Matrix Multiply, PageRank, N-body Simulation, and K-Motif Counting, respectively. MLCD achieves a speed-up of 2.57<inline-formula><tex-math>$\\\\boldsymbol{\\\\times}$</tex-math></inline-formula>, 1.58<inline-formula><tex-math>$\\\\boldsymbol{\\\\times}$</tex-math></inline-formula>, 2.68<inline-formula><tex-math>$\\\\boldsymbol{\\\\times}$</tex-math></inline-formula>, and 1.09<inline-formula><tex-math>$\\\\boldsymbol{\\\\times}$</tex-math></inline-formula> compared to baselines without MLCD. Additionally, MLCD handles end-to-end applications, achieving up to 10% and 46% speed-up over GPU-only and CPU-only solutions with Graph Neural Networks. Furthermore, it achieves 7.28<inline-formula><tex-math>$\\\\boldsymbol{\\\\times}$</tex-math></inline-formula> average speed-up in execution latency over the state-of-the-art approach and determines suitable code versions for unseen input <inline-formula><tex-math>$10^{8}-10^{10}\\\\boldsymbol{\\\\times}$</tex-math></inline-formula> faster.\",\"PeriodicalId\":13087,\"journal\":{\"name\":\"IEEE Transactions on Computers\",\"volume\":\"74 7\",\"pages\":\"2417-2430\"},\"PeriodicalIF\":3.8000,\"publicationDate\":\"2025-04-08\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"IEEE Transactions on Computers\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://ieeexplore.ieee.org/document/10955449/\",\"RegionNum\":2,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"COMPUTER SCIENCE, HARDWARE & ARCHITECTURE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Computers","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/10955449/","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, HARDWARE & ARCHITECTURE","Score":null,"Total":0}

引用次数: 0

摘要

带有硬件加速器的异构系统越来越普遍，并且存在各种用于计算内核的优化实现/算法。然而，没有哪个代码版本和设备（C&D）的最佳组合能够在所有输入情况下都优于其他组合，这就需要一种基于输入选择最佳C&D组合的方法。我们提出了基于机器学习的代码版本和设备选择方法，称为MLCD，该方法利用输入数据特征动态选择最佳的C&D对。我们还应用主动学习来减少构建模型所需的样本数量。在两个不同的CPU-GPU系统上进行了演示，无论测试哪个系统，MLCD都实现了近乎最佳的加速。具体来说，报告中端硬件系统1的结果表明，通过在General Matrix Multiply， PageRank， N-body Simulation和K-Motif Counting中对C&D对的理想选择，它分别达到了99.9%,95.6%，99.9%和98.6%的最佳加速。与没有MLCD的基线相比，MLCD实现了2.57$\boldsymbol{\times}$、1.58$\boldsymbol{\times}$、2.68$\boldsymbol{\times}$和1.09$\boldsymbol{\times}$的加速。此外，MLCD处理端到端应用程序，与图形神经网络的纯gpu和纯cpu解决方案相比，可实现高达10%和46%的加速。此外，与最先进的方法相比，它在执行延迟方面实现了7.28$\boldsymbol{\times}$的平均加速，并且更快地确定了未见输入$10^{8}-10^{10}\boldsymbol{\times}$的合适代码版本。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

MLCD: Machine Learning-Based Code Version and Device Selection for Heterogeneous Systems

Heterogeneous systems with hardware accelerators are increasingly common, and various optimized implementations/algorithms exist for computation kernels. However, no single best combination of code version and device (C&D) can outperform others across all input cases, demanding a method to select the best C&D pair based on input. We present machine learning-based code version and device selection method, named MLCD, that uses input data characteristics to select the best C&D pair dynamically. We also apply active learning to reduce the number of samples needed to construct the model. Demonstrated on two different CPU-GPU systems, MLCD achieves near-optimal speed-up regardless of which systems tested. Concretely, reporting results from system one with mid-end hardwares, it achieves 99.9%, 95.6%, 99.9%, and 98.6% of the optimal acceleration attainable through the ideal choice of C&D pairs in General Matrix Multiply, PageRank, N-body Simulation, and K-Motif Counting, respectively. MLCD achieves a speed-up of 2.57

$\boldsymbol{\times}$

, 1.58

$\boldsymbol{\times}$

, 2.68

$\boldsymbol{\times}$

, and 1.09

$\boldsymbol{\times}$

compared to baselines without MLCD. Additionally, MLCD handles end-to-end applications, achieving up to 10% and 46% speed-up over GPU-only and CPU-only solutions with Graph Neural Networks. Furthermore, it achieves 7.28

$\boldsymbol{\times}$

average speed-up in execution latency over the state-of-the-art approach and determines suitable code versions for unseen input

$10^{8}-10^{10}\boldsymbol{\times}$

faster.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

IEEE Transactions on Computers 工程技术-工程：电子与电气

CiteScore

6.60

自引率

5.40%

发文量

199

审稿时长

6.0 months

期刊介绍： The IEEE Transactions on Computers is a monthly publication with a wide distribution to researchers, developers, technical managers, and educators in the computer field. It publishes papers on research in areas of current interest to the readers. These areas include, but are not limited to, the following: a) computer organizations and architectures; b) operating systems, software systems, and communication protocols; c) real-time systems and embedded systems; d) digital devices, computer components, and interconnection networks; e) specification, design, prototyping, and testing methods and tools; f) performance, fault tolerance, reliability, security, and testability; g) case studies and experimental and theoretical evaluations; and h) new and important applications and trends.