IEEE Transactions on Computers最新文献_第7页

Dual Fast-Track Cache: Organizing Ring-Shaped Racetracks to Work as L1 Caches 双快速通道缓存：组织环形赛道工作作为L1缓存

IF 3.6 2区计算机科学

IEEE Transactions on Computers Pub Date : 2025-06-03 DOI: 10.1109/TC.2025.3575909

Alejandro Valero;Vicente Lorente;Salvador Petit;Julio Sahuquillo

{"title":"Dual Fast-Track Cache: Organizing Ring-Shaped Racetracks to Work as L1 Caches","authors":"Alejandro Valero;Vicente Lorente;Salvador Petit;Julio Sahuquillo","doi":"10.1109/TC.2025.3575909","DOIUrl":"https://doi.org/10.1109/TC.2025.3575909","url":null,"abstract":"Static Random-Access Memory (SRAM) is the fastest memory technology and has been the common design choice for implementing first-level (L1) caches in the processor pipeline, where speed is a key design issue that must be fulfilled. On the contrary, this technology offers much lower density compared to other technologies like Dynamic RAM, limiting L1 cache sizes of modern processors to a few tens of KB. This paper explores the use of slower but denser Domain Wall Memory (DWM) technology for L1 caches. This technology provides slow access times since it arranges multiple bits sequentially in a magnetic racetrack. To access these bits, they need to be shifted in order to place them under a header. A 1-bit shift usually takes one processor cycle, which can significantly hurt the application performance, making this working behavior inappropriate for L1 caches. Based on the locality (temporal and spatial) principles exploited by caches, this work proposes the Dual Fast-Track Cache (Dual FTC) design, a new approach to organizing a set of racetracks to build set-associative caches. Compared to a conventional SRAM cache, Dual FTC enhances storage capacity by 5× while incurring minimal shifting overhead, thereby rendering it a practical and appealing solution for L1 cache implementations. Experimental results show that the devised cache organization is as fast as an SRAM cache for 78% and 86% of the L1 data cache hits and L1 instruction cache hits, respectively (i.e., no shift is required). Consequently, due to the larger L1 cache capacities, significant system performance gains (by 22% on average) are obtained under the same silicon area.","PeriodicalId":13087,"journal":{"name":"IEEE Transactions on Computers","volume":"74 8","pages":"2812-2826"},"PeriodicalIF":3.6,"publicationDate":"2025-06-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=11022726","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144598046","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Ls-Stream: Lightening Stragglers in Join Operators for Skewed Data Stream Processing Ls-Stream：在倾斜数据流处理的连接算子中减轻掉队者

IF 3.6 2区计算机科学

IEEE Transactions on Computers Pub Date : 2025-06-03 DOI: 10.1109/TC.2025.3575917

Minghui Wu;Dawei Sun;Shang Gao;Keqin Li;Rajkumar Buyya

{"title":"Ls-Stream: Lightening Stragglers in Join Operators for Skewed Data Stream Processing","authors":"Minghui Wu;Dawei Sun;Shang Gao;Keqin Li;Rajkumar Buyya","doi":"10.1109/TC.2025.3575917","DOIUrl":"https://doi.org/10.1109/TC.2025.3575917","url":null,"abstract":"Load imbalance can lead to the emergence of stragglers, i.e., join instances that significantly lag behind others in processing data streams. Currently, state-of-the-art solutions are capable of balancing the load between join instances to mitigate stragglers by managing hot keys and random partitioning. However, these solutions rely on either complicated routing strategies or resource-inefficient processing structures, making them susceptible to frequent changes in load between instances. Therefore, we present Ls-Stream, a data stream scheduler that aims to support dynamic workload assignment for join instances to lighten stragglers. This paper outlines our solution from the following aspects: (1) The models for partitioning, communication, matrix, and resource are developed, formalizing problems like imbalanced load between join instances and state migration costs. (2) Ls-Stream employs a two-level routing strategy for workload allocation by combining hash-based and key-based data partitioning, specifying the destination join instances for data tuples. (3) Ls-Stream also constructs a fine-grained model for minimizing the state migration cost. This allows us to make trade-offs between data transfer overhead and migration benefits. (4) Experimental results demonstrate significant improvements made by Ls-Stream: reducing maximum system latency by 49.3% and increasing maximum throughput by more than 2x compared to existing state-of-the-art works.","PeriodicalId":13087,"journal":{"name":"IEEE Transactions on Computers","volume":"74 8","pages":"2841-2855"},"PeriodicalIF":3.6,"publicationDate":"2025-06-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144597614","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Fast Garbage Collection in Erasure-Coded Storage Clusters 擦除编码存储集群中的快速垃圾回收

IF 3.6 2区计算机科学

IEEE Transactions on Computers Pub Date : 2025-06-03 DOI: 10.1109/TC.2025.3575914

Hai Zhou;Dan Feng;Yuchong Hu;Wei Wang;Huadong Huang

{"title":"Fast Garbage Collection in Erasure-Coded Storage Clusters","authors":"Hai Zhou;Dan Feng;Yuchong Hu;Wei Wang;Huadong Huang","doi":"10.1109/TC.2025.3575914","DOIUrl":"https://doi.org/10.1109/TC.2025.3575914","url":null,"abstract":"<italic>Erasure codes (EC) have been widely adopted to provide high data reliability with low storage costs in clusters. Due to the deletion and out-of-place update operations, some data blocks are invalid, which unfortunately arouses the tedious <italic>garbage collection (GC) problem. Several limitations still plague existing designs: substantial network traffic, unbalanced traffic load, and low read/write performance after GC. This paper proposes FastGC, a fast garbage collection method that merges the old stripes into a new stripe and reclaims invalid blocks. FastGC quickly generates an efficient merge solution by stripe grouping and bit sequences operations to minimize network traffic and maintains data block distributions of the same stripe to ensure read performance. It carefully allocates the storage space for new stripes during merging to eliminate the discontinuous free spaces that affect write performance. Furthermore, to accelerate the parity updates after merging, FastGC greedily schedules the transmission links for multi-stripe updates to balance the traffic load across nodes and adopts a maximum flow algorithm to saturate the bandwidth utilization. Comprehensive evaluation results show via simulations and Alibaba ECS experiments that FastGC can significantly reduce 10.36%-81.22% of the network traffic and 34.25%-72.36% of the GC time while maintaining read/write performance after GC.","PeriodicalId":13087,"journal":{"name":"IEEE Transactions on Computers","volume":"74 8","pages":"2827-2840"},"PeriodicalIF":3.6,"publicationDate":"2025-06-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144597657","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

DCAS-BMT: Dynamic Construction and Adjustment of Skewed Bonsai Merkle Tree for Performance Enhancement in Secure Non-Volatile Memory DCAS-BMT：用于安全非易失性存储器性能增强的倾斜盆景默克尔树的动态构建和调整

IF 3.6 2区计算机科学

IEEE Transactions on Computers Pub Date : 2025-04-10 DOI: 10.1109/TC.2025.3558007

Yu Zhang;Renhai Chen;Hangyu Yan;Hongyue Wu;Zhiyong Feng

{"title":"DCAS-BMT: Dynamic Construction and Adjustment of Skewed Bonsai Merkle Tree for Performance Enhancement in Secure Non-Volatile Memory","authors":"Yu Zhang;Renhai Chen;Hangyu Yan;Hongyue Wu;Zhiyong Feng","doi":"10.1109/TC.2025.3558007","DOIUrl":"https://doi.org/10.1109/TC.2025.3558007","url":null,"abstract":"Traditional DRAM-based memory solutions face challenges, including high energy consumption and limited scalability. Non-Volatile Memory (NVM) offers low energy consumption and high scalability. However, security challenges, particularly data remanence vulnerabilities, persist. Prevalent methods such as the Bonsai Merkle Tree (BMT) are employed to ensure data security. However, the consistency requirements for integrity tree updates have led to performance issues. It is observed that compared to a secure NVM system without persistent secure metadata, the average overhead for updating and persisting the BMT root with persistent secure metadata is as high as 2.48 times. Therefore, this paper aims to mitigate these inefficiencies by leveraging the principle of memory access locality. We propose the Dynamic Construction and Adjustment of Skewed Bonsai Merkle Tree (DCAS-BMT). The DCAS-BMT is dynamically built and continuously adjusted at runtime according to access weights, ensuring frequently accessed memory blocks reside on shorter paths to the root node. This reduces the verification steps for frequently accessed memory blocks, thereby lowering the overall cost of memory authentication and updates. Experimental results using the USIMM memory simulator demonstrate that compared to the widely used BMT approach, the DCAS-BMT scheme shows a performance improvement of 34.1%.","PeriodicalId":13087,"journal":{"name":"IEEE Transactions on Computers","volume":"74 7","pages":"2183-2194"},"PeriodicalIF":3.6,"publicationDate":"2025-04-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144255511","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

DCGG: A Dynamically Adaptive and Hardware-Software Coordinated Runtime System for GNN Acceleration on GPUs 基于gpu的GNN加速动态自适应软硬件协调运行时系统

IF 3.6 2区计算机科学

IEEE Transactions on Computers Pub Date : 2025-04-09 DOI: 10.1109/TC.2025.3558042

Guoqing Xiao;Li Xia;Yuedan Chen;Hongyang Chen;Wangdong Yang

{"title":"DCGG: A Dynamically Adaptive and Hardware-Software Coordinated Runtime System for GNN Acceleration on GPUs","authors":"Guoqing Xiao;Li Xia;Yuedan Chen;Hongyang Chen;Wangdong Yang","doi":"10.1109/TC.2025.3558042","DOIUrl":"https://doi.org/10.1109/TC.2025.3558042","url":null,"abstract":"Graph neural networks (GNNs) are a prominent trend in graph-based deep learning, known for their capacity to produce high-quality node embeddings. However, the existing GNN framework design is only implemented from the algorithm level, and the hardware architecture of the GPU is not fully utilized. To this end, we propose DCGG, a dynamic runtime adaptive framework, which can accelerate various GNN workloads on GPU platforms. DCGG has carried out deeper optimization work mainly in terms of load balancing and software and hardware matching. Accordingly, three optimization strategies are proposed. First, we propose dynamic 2D workload management methods and perform customized optimization based on it, effectively reducing additional memory operations. Second, a new slicing strategy is adopted, combined with hardware features, to effectively improve the efficiency of data reuse. Third, DCGG uses the Quantitative Dimension Parallel Strategy to optimize dimensions and parallel methods, greatly improving load balance and data locality. Extensive experiments demonstrate that DCGG outperforms the state-of-the-art GNN computing frameworks, such as Deep Graph Library (up to 3.10<inline-formula><tex-math>$boldsymbol{times}$</tex-math></inline-formula> faster) and GNNAdvisor (up to 2.80<inline-formula><tex-math>$boldsymbol{times}$</tex-math></inline-formula> faster), on mainstream GNN architectures across various datasets.","PeriodicalId":13087,"journal":{"name":"IEEE Transactions on Computers","volume":"74 7","pages":"2293-2305"},"PeriodicalIF":3.6,"publicationDate":"2025-04-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144255516","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

SmartZone: Runtime Support for Secure and Efficient On-Device Inference on ARM TrustZone SmartZone：运行时支持在ARM TrustZone上安全高效的设备上推断

IF 3.6 2区计算机科学

IEEE Transactions on Computers Pub Date : 2025-04-08 DOI: 10.1109/TC.2025.3557971

Zhaolong Jian;Xu Liu;Qiankun Dong;Longkai Cheng;Xueshuo Xie;Tao Li

{"title":"SmartZone: Runtime Support for Secure and Efficient On-Device Inference on ARM TrustZone","authors":"Zhaolong Jian;Xu Liu;Qiankun Dong;Longkai Cheng;Xueshuo Xie;Tao Li","doi":"10.1109/TC.2025.3557971","DOIUrl":"https://doi.org/10.1109/TC.2025.3557971","url":null,"abstract":"On-device inference is a burgeoning paradigm that performs model inference locally on end devices, allowing private data to remain local. ARM TrustZone as a widely supported trusted execution environment has been applied to provide confidentiality protection for on-device inference. However, with the rise of large-scale models like large language models (LLMs), TrustZone-based on-device inference faces challenges in migration difficulties and inefficient execution. The rudimentary TEE OS on TrustZone lacks both the inference runtime needed for building models and the parallel support necessary to accelerate inference. Moreover, the limited secure memory resources on end devices further constrain the model size and degrade performance. In this paper, we propose SmartZone to provide runtime support for secure and efficient on-device inference on TrustZone. SmartZone consists three main components: (1) a trusted inference-oriented operator set, providing the underlying mechanisms adapted to the TrustZone's execution mode for trusted inference of DNN models and LLMs. (2) the proactive multi-threading parallel support, which increases the number of CPU cores in the secure state via cross-world thread collaboration to achieve parallelism, and (3) the on-demand secure memory management method, which statically allocates the appropriate secure memory size based on pre-execution resource analysis. We implement a prototype of SmartZone on the Raspberry Pi 3B+ board and evaluate it on four well-known DNN models and llama2 LLM. Extensive experimental results show that SmartZone provides end-to-end protection for on-device inference while maintaining excellent performance. Compared to the origin trusted inference, SmartZone accelerates the inference speed by up to <inline-formula><tex-math>$4.26boldsymbol{times}$</tex-math></inline-formula> and reduces energy consumption by <inline-formula><tex-math>$65.81%$</tex-math></inline-formula>.","PeriodicalId":13087,"journal":{"name":"IEEE Transactions on Computers","volume":"74 6","pages":"2144-2158"},"PeriodicalIF":3.6,"publicationDate":"2025-04-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143929675","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Accelerating RNA-Seq Quantification on a Real Processing-in-Memory System 在真实内存处理系统上加速RNA-Seq定量

IF 3.6 2区计算机科学

IEEE Transactions on Computers Pub Date : 2025-04-08 DOI: 10.1109/TC.2025.3558075

Liang-Chi Chen;Chien-Chung Ho;Yuan-Hao Chang

{"title":"Accelerating RNA-Seq Quantification on a Real Processing-in-Memory System","authors":"Liang-Chi Chen;Chien-Chung Ho;Yuan-Hao Chang","doi":"10.1109/TC.2025.3558075","DOIUrl":"https://doi.org/10.1109/TC.2025.3558075","url":null,"abstract":"Recently, with the growth of the required data size for emerging applications (e.g., graph processing and machine learning), the von Neumann bottleneck has become a main problem for restricting the throughput of the applications. To address the problem, an acceleration technique called Processing in Memory (PIM) has garnered attention due to its potential to reduce off-chip data movement between the processing unit (e.g., CPU) and memory device (e.g., DRAM). In 2019, UPMEM introduced the commercially available processing-in-memory product, the DRAM Processing Unit (DPU) <xref>[8]</xref>, showing a new chance for accelerating data-intensive applications. Among data-intensive applications, RNA sequence (RNA-seq) quantification is used to measure the abundance of RNA sequences, and it also plays a critical role in the field of bioinformatics. We aim to leverage UPMEM DPU to accelerate RNA-seq Quantification. However, due to the DPU usage limitations caused by DPU hardware, there are some challenges to realizing RNA-seq Quantification on the DPU system. To overcome these challenges, we propose UpPipe, which consists of the DPU-friendly transcriptome allocation, the DPU-aware pipeline management, and the WRAM prefetching scheme. The UpPipe considers the hardware limitations of DPUs, enabling efficient sequence alignment even within the resource-constrained DPUs. The experimental results demonstrate the feasibility and efficiency of our proposed design. We also provide an evaluation study on the impact of data granularity selection on pipeline management and the optimal size for the WRAM prefetching scheme.","PeriodicalId":13087,"journal":{"name":"IEEE Transactions on Computers","volume":"74 7","pages":"2334-2347"},"PeriodicalIF":3.6,"publicationDate":"2025-04-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144255485","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Accelerating Loss Recovery for Content Delivery Network 加速内容交付网络的损失恢复

IF 3.6 2区计算机科学

IEEE Transactions on Computers Pub Date : 2025-04-08 DOI: 10.1109/TC.2025.3558020

Tong Li;Wei Liu;Xinyu Ma;Shuaipeng Zhu;Jingkun Cao;Duling Xu;Zhaoqi Yang;Senzhen Liu;Taotao Zhang;Yinfeng Zhu;Bo Wu;Kezhi Wang;Ke Xu

{"title":"Accelerating Loss Recovery for Content Delivery Network","authors":"Tong Li;Wei Liu;Xinyu Ma;Shuaipeng Zhu;Jingkun Cao;Duling Xu;Zhaoqi Yang;Senzhen Liu;Taotao Zhang;Yinfeng Zhu;Bo Wu;Kezhi Wang;Ke Xu","doi":"10.1109/TC.2025.3558020","DOIUrl":"https://doi.org/10.1109/TC.2025.3558020","url":null,"abstract":"Packet losses significantly impact the user experience of content delivery network (CDN) services such as live streaming and data backup-and-archiving. However, our production network measurement studies show that the legacy loss recovery is far from satisfactory due to the wide-area loss characteristics (i.e., dynamics and burstiness) in the wild. In this paper, we propose a sender-side Adaptive ReTransmission scheme, ART, which minimizes the recovery time of lost packets with minimal redundancy cost. Distinguishing itself from forward-error-correction (FEC), which preemptively sends redundant data packets to prevent loss, ART functions as an automatic-repeat-request (ARQ) scheme. It applies redundancy specifically to lost packets instead of unlost packets, thereby addressing the characteristic patterns of wide-area losses in real-world scenarios. We implement ART upon QUIC protocol and evaluate it via both trace-driven emulation and real-world deployment. The results show that ART reduces up to 34% of flow completion time (FCT) for delay-sensitive transmissions, improves up to 26% of goodput for throughput-intensive transmissions, reduces 11.6% video playback rebuffering, and saves up to 90% of redundancy cost.","PeriodicalId":13087,"journal":{"name":"IEEE Transactions on Computers","volume":"74 7","pages":"2223-2237"},"PeriodicalIF":3.6,"publicationDate":"2025-04-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144255618","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

CIMUS: 3D-Stacked Computing-in-Memory Under Image Sensor Architecture for Efficient Machine Vision 基于高效机器视觉的图像传感器架构下的内存3d堆叠计算

IF 3.6 2区计算机科学

IEEE Transactions on Computers Pub Date : 2025-04-08 DOI: 10.1109/TC.2025.3558068

Lixia Han;Yiyang Chen;Siyuan Chen;Haozhang Yang;Ao Shi;Guihai Yu;Jiaqi Li;Zheng Zhou;Yijiao Wang;Yanzhi Wang;Xiaoyan Liu;Jinfeng Kang;Peng Huang

{"title":"CIMUS: 3D-Stacked Computing-in-Memory Under Image Sensor Architecture for Efficient Machine Vision","authors":"Lixia Han;Yiyang Chen;Siyuan Chen;Haozhang Yang;Ao Shi;Guihai Yu;Jiaqi Li;Zheng Zhou;Yijiao Wang;Yanzhi Wang;Xiaoyan Liu;Jinfeng Kang;Peng Huang","doi":"10.1109/TC.2025.3558068","DOIUrl":"https://doi.org/10.1109/TC.2025.3558068","url":null,"abstract":"Computational image sensors with CNN processing capabilities are emerging to alleviate the energy-intensive and time-consuming data movement between sensors and external processors. However, deploying CNN models onto these computational image sensors faces challenges from the limited on-chip memory resources and insufficient image processing throughput. This work proposes a 3D-stacked NAND flash-based computing-in-memory under image sensor architecture (CIMUS) to facilitate the complete deployment of CNN model. To fully leverage the potential of high bandwidth from the 3D-stacked integration, we design a novel distributed CNN mapping and dataflow to process the full focal plane image in parallel, which senses and recognizes ImageNet tasks with >1000fps. To tackle the computational error of inputs “0” in 3D NAND flash-based CIM, we propose an input-independent offset compensation method, which reduces the average vector-matrix multiplication (VMM) error by 48%. Evaluation results indicate that CIMUS architecture achieves a 9.8× improvement in CNN inference speed and a 33× boost in energy efficiency compared to the state-of-the-art computational image sensor in the ImageNet recognition task.","PeriodicalId":13087,"journal":{"name":"IEEE Transactions on Computers","volume":"74 7","pages":"2321-2333"},"PeriodicalIF":3.6,"publicationDate":"2025-04-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144264218","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

MLCD: Machine Learning-Based Code Version and Device Selection for Heterogeneous Systems 基于机器学习的异构系统代码版本和设备选择

IF 3.6 2区计算机科学

IEEE Transactions on Computers Pub Date : 2025-04-08 DOI: 10.1109/TC.2025.3558606

Kaiwen Cao;Hanchen Ye;Yihan Pang;Deming Chen

{"title":"MLCD: Machine Learning-Based Code Version and Device Selection for Heterogeneous Systems","authors":"Kaiwen Cao;Hanchen Ye;Yihan Pang;Deming Chen","doi":"10.1109/TC.2025.3558606","DOIUrl":"https://doi.org/10.1109/TC.2025.3558606","url":null,"abstract":"Heterogeneous systems with hardware accelerators are increasingly common, and various optimized implementations/algorithms exist for computation kernels. However, no single best combination of code version and device (C&D) can outperform others across all input cases, demanding a method to select the best C&D pair based on input. We present machine learning-based code version and device selection method, named MLCD, that uses input data characteristics to select the best C&D pair dynamically. We also apply active learning to reduce the number of samples needed to construct the model. Demonstrated on two different CPU-GPU systems, MLCD achieves near-optimal speed-up regardless of which systems tested. Concretely, reporting results from system one with mid-end hardwares, it achieves 99.9%, 95.6%, 99.9%, and 98.6% of the optimal acceleration attainable through the ideal choice of C&D pairs in General Matrix Multiply, PageRank, N-body Simulation, and K-Motif Counting, respectively. MLCD achieves a speed-up of 2.57<inline-formula><tex-math>$boldsymbol{times}$</tex-math></inline-formula>, 1.58<inline-formula><tex-math>$boldsymbol{times}$</tex-math></inline-formula>, 2.68<inline-formula><tex-math>$boldsymbol{times}$</tex-math></inline-formula>, and 1.09<inline-formula><tex-math>$boldsymbol{times}$</tex-math></inline-formula> compared to baselines without MLCD. Additionally, MLCD handles end-to-end applications, achieving up to 10% and 46% speed-up over GPU-only and CPU-only solutions with Graph Neural Networks. Furthermore, it achieves 7.28<inline-formula><tex-math>$boldsymbol{times}$</tex-math></inline-formula> average speed-up in execution latency over the state-of-the-art approach and determines suitable code versions for unseen input <inline-formula><tex-math>$10^{8}-10^{10}boldsymbol{times}$</tex-math></inline-formula> faster.","PeriodicalId":13087,"journal":{"name":"IEEE Transactions on Computers","volume":"74 7","pages":"2417-2430"},"PeriodicalIF":3.6,"publicationDate":"2025-04-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144255509","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0