Rui Kong;Yuanchun Li;Weijun Wang;Linghe Kong;Yunxin Liu
{"title":"Serving MoE Models on Resource-Constrained Edge Devices via Dynamic Expert Swapping","authors":"Rui Kong;Yuanchun Li;Weijun Wang;Linghe Kong;Yunxin Liu","doi":"10.1109/TC.2025.3575905","DOIUrl":"https://doi.org/10.1109/TC.2025.3575905","url":null,"abstract":"Mixture of experts (MoE) is a popular technique in deep learning that improves model capacity with conditionally-activated parallel neural network modules (experts). However, serving MoE models in resource-constrained latency-critical edge scenarios is challenging due to the significantly increased model size and complexity. In this paper, we first analyze the behavior pattern of MoE models in continuous inference scenarios, which leads to three key observations about the expert activations, including temporal locality, exchangeability, and skippable computation. Based on these observations, we introduce PC-MoE, an inference framework for resource-constrained continuous MoE model serving. The core of PC-MoE is a new data structure, <italic>Parameter Committee</i>, that intelligently maintains a subset of important experts in use to reduce resource consumption. To evaluate the effectiveness of PC-MoE, we conduct experiments using state-of-the-art MoE models on common computer vision and natural language processing tasks. The results demonstrate optimal trade-offs between resource consumption and model accuracy achieved by PC-MoE. For instance, on object detection tasks with the Swin-MoE model, our approach can reduce memory usage and latency by 42.34% and 18.63% with only 0.10% accuracy degradation.","PeriodicalId":13087,"journal":{"name":"IEEE Transactions on Computers","volume":"74 8","pages":"2799-2811"},"PeriodicalIF":3.6,"publicationDate":"2025-06-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144597783","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Dual Fast-Track Cache: Organizing Ring-Shaped Racetracks to Work as L1 Caches","authors":"Alejandro Valero;Vicente Lorente;Salvador Petit;Julio Sahuquillo","doi":"10.1109/TC.2025.3575909","DOIUrl":"https://doi.org/10.1109/TC.2025.3575909","url":null,"abstract":"Static Random-Access Memory (SRAM) is the fastest memory technology and has been the common design choice for implementing first-level (L1) caches in the processor pipeline, where speed is a key design issue that must be fulfilled. On the contrary, this technology offers much lower density compared to other technologies like Dynamic RAM, limiting L1 cache sizes of modern processors to a few tens of KB. This paper explores the use of slower but denser Domain Wall Memory (DWM) technology for L1 caches. This technology provides slow access times since it arranges multiple bits sequentially in a magnetic racetrack. To access these bits, they need to be shifted in order to place them under a header. A 1-bit shift usually takes one processor cycle, which can significantly hurt the application performance, making this working behavior inappropriate for L1 caches. Based on the locality (temporal and spatial) principles exploited by caches, this work proposes the Dual Fast-Track Cache (Dual FTC) design, a new approach to organizing a set of racetracks to build set-associative caches. Compared to a conventional SRAM cache, Dual FTC enhances storage capacity by 5× while incurring minimal shifting overhead, thereby rendering it a practical and appealing solution for L1 cache implementations. Experimental results show that the devised cache organization is as fast as an SRAM cache for 78% and 86% of the L1 data cache hits and L1 instruction cache hits, respectively (i.e., no shift is required). Consequently, due to the larger L1 cache capacities, significant system performance gains (by 22% on average) are obtained under the same silicon area.","PeriodicalId":13087,"journal":{"name":"IEEE Transactions on Computers","volume":"74 8","pages":"2812-2826"},"PeriodicalIF":3.6,"publicationDate":"2025-06-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=11022726","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144598046","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Ls-Stream: Lightening Stragglers in Join Operators for Skewed Data Stream Processing","authors":"Minghui Wu;Dawei Sun;Shang Gao;Keqin Li;Rajkumar Buyya","doi":"10.1109/TC.2025.3575917","DOIUrl":"https://doi.org/10.1109/TC.2025.3575917","url":null,"abstract":"Load imbalance can lead to the emergence of stragglers, i.e., join instances that significantly lag behind others in processing data streams. Currently, state-of-the-art solutions are capable of balancing the load between join instances to mitigate stragglers by managing hot keys and random partitioning. However, these solutions rely on either complicated routing strategies or resource-inefficient processing structures, making them susceptible to frequent changes in load between instances. Therefore, we present Ls-Stream, a data stream scheduler that aims to support dynamic workload assignment for join instances to lighten stragglers. This paper outlines our solution from the following aspects: (1) The models for partitioning, communication, matrix, and resource are developed, formalizing problems like imbalanced load between join instances and state migration costs. (2) Ls-Stream employs a two-level routing strategy for workload allocation by combining hash-based and key-based data partitioning, specifying the destination join instances for data tuples. (3) Ls-Stream also constructs a fine-grained model for minimizing the state migration cost. This allows us to make trade-offs between data transfer overhead and migration benefits. (4) Experimental results demonstrate significant improvements made by Ls-Stream: reducing maximum system latency by 49.3% and increasing maximum throughput by more than 2x compared to existing state-of-the-art works.","PeriodicalId":13087,"journal":{"name":"IEEE Transactions on Computers","volume":"74 8","pages":"2841-2855"},"PeriodicalIF":3.6,"publicationDate":"2025-06-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144597614","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Hai Zhou;Dan Feng;Yuchong Hu;Wei Wang;Huadong Huang
{"title":"Fast Garbage Collection in Erasure-Coded Storage Clusters","authors":"Hai Zhou;Dan Feng;Yuchong Hu;Wei Wang;Huadong Huang","doi":"10.1109/TC.2025.3575914","DOIUrl":"https://doi.org/10.1109/TC.2025.3575914","url":null,"abstract":"<italic>Erasure codes</i> (EC) have been widely adopted to provide high data reliability with low storage costs in clusters. Due to the deletion and out-of-place update operations, some data blocks are invalid, which unfortunately arouses the tedious <italic>garbage collection</i> (GC) problem. Several limitations still plague existing designs: substantial network traffic, unbalanced traffic load, and low read/write performance after GC. This paper proposes FastGC, a fast garbage collection method that merges the old stripes into a new stripe and reclaims invalid blocks. FastGC quickly generates an efficient merge solution by stripe grouping and bit sequences operations to minimize network traffic and maintains data block distributions of the same stripe to ensure read performance. It carefully allocates the storage space for new stripes during merging to eliminate the discontinuous free spaces that affect write performance. Furthermore, to accelerate the parity updates after merging, FastGC greedily schedules the transmission links for multi-stripe updates to balance the traffic load across nodes and adopts a maximum flow algorithm to saturate the bandwidth utilization. Comprehensive evaluation results show via simulations and Alibaba ECS experiments that FastGC can significantly reduce 10.36%-81.22% of the network traffic and 34.25%-72.36% of the GC time while maintaining read/write performance after GC.","PeriodicalId":13087,"journal":{"name":"IEEE Transactions on Computers","volume":"74 8","pages":"2827-2840"},"PeriodicalIF":3.6,"publicationDate":"2025-06-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144597657","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"DCAS-BMT: Dynamic Construction and Adjustment of Skewed Bonsai Merkle Tree for Performance Enhancement in Secure Non-Volatile Memory","authors":"Yu Zhang;Renhai Chen;Hangyu Yan;Hongyue Wu;Zhiyong Feng","doi":"10.1109/TC.2025.3558007","DOIUrl":"https://doi.org/10.1109/TC.2025.3558007","url":null,"abstract":"Traditional DRAM-based memory solutions face challenges, including high energy consumption and limited scalability. Non-Volatile Memory (NVM) offers low energy consumption and high scalability. However, security challenges, particularly data remanence vulnerabilities, persist. Prevalent methods such as the Bonsai Merkle Tree (BMT) are employed to ensure data security. However, the consistency requirements for integrity tree updates have led to performance issues. It is observed that compared to a secure NVM system without persistent secure metadata, the average overhead for updating and persisting the BMT root with persistent secure metadata is as high as 2.48 times. Therefore, this paper aims to mitigate these inefficiencies by leveraging the principle of memory access locality. We propose the Dynamic Construction and Adjustment of Skewed Bonsai Merkle Tree (DCAS-BMT). The DCAS-BMT is dynamically built and continuously adjusted at runtime according to access weights, ensuring frequently accessed memory blocks reside on shorter paths to the root node. This reduces the verification steps for frequently accessed memory blocks, thereby lowering the overall cost of memory authentication and updates. Experimental results using the USIMM memory simulator demonstrate that compared to the widely used BMT approach, the DCAS-BMT scheme shows a performance improvement of 34.1%.","PeriodicalId":13087,"journal":{"name":"IEEE Transactions on Computers","volume":"74 7","pages":"2183-2194"},"PeriodicalIF":3.6,"publicationDate":"2025-04-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144255511","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Guoqing Xiao;Li Xia;Yuedan Chen;Hongyang Chen;Wangdong Yang
{"title":"DCGG: A Dynamically Adaptive and Hardware-Software Coordinated Runtime System for GNN Acceleration on GPUs","authors":"Guoqing Xiao;Li Xia;Yuedan Chen;Hongyang Chen;Wangdong Yang","doi":"10.1109/TC.2025.3558042","DOIUrl":"https://doi.org/10.1109/TC.2025.3558042","url":null,"abstract":"Graph neural networks (GNNs) are a prominent trend in graph-based deep learning, known for their capacity to produce high-quality node embeddings. However, the existing GNN framework design is only implemented from the algorithm level, and the hardware architecture of the GPU is not fully utilized. To this end, we propose DCGG, a dynamic runtime adaptive framework, which can accelerate various GNN workloads on GPU platforms. DCGG has carried out deeper optimization work mainly in terms of load balancing and software and hardware matching. Accordingly, three optimization strategies are proposed. First, we propose dynamic 2D workload management methods and perform customized optimization based on it, effectively reducing additional memory operations. Second, a new slicing strategy is adopted, combined with hardware features, to effectively improve the efficiency of data reuse. Third, DCGG uses the Quantitative Dimension Parallel Strategy to optimize dimensions and parallel methods, greatly improving load balance and data locality. Extensive experiments demonstrate that DCGG outperforms the state-of-the-art GNN computing frameworks, such as Deep Graph Library (up to 3.10<inline-formula><tex-math>$boldsymbol{times}$</tex-math></inline-formula> faster) and GNNAdvisor (up to 2.80<inline-formula><tex-math>$boldsymbol{times}$</tex-math></inline-formula> faster), on mainstream GNN architectures across various datasets.","PeriodicalId":13087,"journal":{"name":"IEEE Transactions on Computers","volume":"74 7","pages":"2293-2305"},"PeriodicalIF":3.6,"publicationDate":"2025-04-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144255516","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Zhaolong Jian;Xu Liu;Qiankun Dong;Longkai Cheng;Xueshuo Xie;Tao Li
{"title":"SmartZone: Runtime Support for Secure and Efficient On-Device Inference on ARM TrustZone","authors":"Zhaolong Jian;Xu Liu;Qiankun Dong;Longkai Cheng;Xueshuo Xie;Tao Li","doi":"10.1109/TC.2025.3557971","DOIUrl":"https://doi.org/10.1109/TC.2025.3557971","url":null,"abstract":"On-device inference is a burgeoning paradigm that performs model inference locally on end devices, allowing private data to remain local. ARM TrustZone as a widely supported trusted execution environment has been applied to provide confidentiality protection for on-device inference. However, with the rise of large-scale models like large language models (LLMs), TrustZone-based on-device inference faces challenges in migration difficulties and inefficient execution. The rudimentary TEE OS on TrustZone lacks both the inference runtime needed for building models and the parallel support necessary to accelerate inference. Moreover, the limited secure memory resources on end devices further constrain the model size and degrade performance. In this paper, we propose SmartZone to provide runtime support for secure and efficient on-device inference on TrustZone. SmartZone consists three main components: (1) a trusted inference-oriented operator set, providing the underlying mechanisms adapted to the TrustZone's execution mode for trusted inference of DNN models and LLMs. (2) the proactive multi-threading parallel support, which increases the number of CPU cores in the secure state via cross-world thread collaboration to achieve parallelism, and (3) the on-demand secure memory management method, which statically allocates the appropriate secure memory size based on pre-execution resource analysis. We implement a prototype of SmartZone on the Raspberry Pi 3B+ board and evaluate it on four well-known DNN models and llama2 LLM. Extensive experimental results show that SmartZone provides end-to-end protection for on-device inference while maintaining excellent performance. Compared to the origin trusted inference, SmartZone accelerates the inference speed by up to <inline-formula><tex-math>$4.26boldsymbol{times}$</tex-math></inline-formula> and reduces energy consumption by <inline-formula><tex-math>$65.81%$</tex-math></inline-formula>.","PeriodicalId":13087,"journal":{"name":"IEEE Transactions on Computers","volume":"74 6","pages":"2144-2158"},"PeriodicalIF":3.6,"publicationDate":"2025-04-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143929675","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Accelerating RNA-Seq Quantification on a Real Processing-in-Memory System","authors":"Liang-Chi Chen;Chien-Chung Ho;Yuan-Hao Chang","doi":"10.1109/TC.2025.3558075","DOIUrl":"https://doi.org/10.1109/TC.2025.3558075","url":null,"abstract":"Recently, with the growth of the required data size for emerging applications (e.g., graph processing and machine learning), the von Neumann bottleneck has become a main problem for restricting the throughput of the applications. To address the problem, an acceleration technique called Processing in Memory (PIM) has garnered attention due to its potential to reduce off-chip data movement between the processing unit (e.g., CPU) and memory device (e.g., DRAM). In 2019, UPMEM introduced the commercially available processing-in-memory product, the DRAM Processing Unit (DPU) <xref>[8]</xref>, showing a new chance for accelerating data-intensive applications. Among data-intensive applications, RNA sequence (RNA-seq) quantification is used to measure the abundance of RNA sequences, and it also plays a critical role in the field of bioinformatics. We aim to leverage UPMEM DPU to accelerate RNA-seq Quantification. However, due to the DPU usage limitations caused by DPU hardware, there are some challenges to realizing RNA-seq Quantification on the DPU system. To overcome these challenges, we propose UpPipe, which consists of the DPU-friendly transcriptome allocation, the DPU-aware pipeline management, and the WRAM prefetching scheme. The UpPipe considers the hardware limitations of DPUs, enabling efficient sequence alignment even within the resource-constrained DPUs. The experimental results demonstrate the feasibility and efficiency of our proposed design. We also provide an evaluation study on the impact of data granularity selection on pipeline management and the optimal size for the WRAM prefetching scheme.","PeriodicalId":13087,"journal":{"name":"IEEE Transactions on Computers","volume":"74 7","pages":"2334-2347"},"PeriodicalIF":3.6,"publicationDate":"2025-04-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144255485","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Accelerating Loss Recovery for Content Delivery Network","authors":"Tong Li;Wei Liu;Xinyu Ma;Shuaipeng Zhu;Jingkun Cao;Duling Xu;Zhaoqi Yang;Senzhen Liu;Taotao Zhang;Yinfeng Zhu;Bo Wu;Kezhi Wang;Ke Xu","doi":"10.1109/TC.2025.3558020","DOIUrl":"https://doi.org/10.1109/TC.2025.3558020","url":null,"abstract":"Packet losses significantly impact the user experience of content delivery network (CDN) services such as live streaming and data backup-and-archiving. However, our production network measurement studies show that the legacy loss recovery is far from satisfactory due to the wide-area loss characteristics (i.e., dynamics and burstiness) in the wild. In this paper, we propose a sender-side Adaptive ReTransmission scheme, ART, which minimizes the recovery time of lost packets with minimal redundancy cost. Distinguishing itself from forward-error-correction (FEC), which preemptively sends redundant data packets to prevent loss, ART functions as an automatic-repeat-request (ARQ) scheme. It applies redundancy specifically to lost packets instead of unlost packets, thereby addressing the characteristic patterns of wide-area losses in real-world scenarios. We implement ART upon QUIC protocol and evaluate it via both trace-driven emulation and real-world deployment. The results show that ART reduces up to 34% of flow completion time (FCT) for delay-sensitive transmissions, improves up to 26% of goodput for throughput-intensive transmissions, reduces 11.6% video playback rebuffering, and saves up to 90% of redundancy cost.","PeriodicalId":13087,"journal":{"name":"IEEE Transactions on Computers","volume":"74 7","pages":"2223-2237"},"PeriodicalIF":3.6,"publicationDate":"2025-04-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144255618","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"CIMUS: 3D-Stacked Computing-in-Memory Under Image Sensor Architecture for Efficient Machine Vision","authors":"Lixia Han;Yiyang Chen;Siyuan Chen;Haozhang Yang;Ao Shi;Guihai Yu;Jiaqi Li;Zheng Zhou;Yijiao Wang;Yanzhi Wang;Xiaoyan Liu;Jinfeng Kang;Peng Huang","doi":"10.1109/TC.2025.3558068","DOIUrl":"https://doi.org/10.1109/TC.2025.3558068","url":null,"abstract":"Computational image sensors with CNN processing capabilities are emerging to alleviate the energy-intensive and time-consuming data movement between sensors and external processors. However, deploying CNN models onto these computational image sensors faces challenges from the limited on-chip memory resources and insufficient image processing throughput. This work proposes a 3D-stacked NAND flash-based computing-in-memory under image sensor architecture (CIMUS) to facilitate the complete deployment of CNN model. To fully leverage the potential of high bandwidth from the 3D-stacked integration, we design a novel distributed CNN mapping and dataflow to process the full focal plane image in parallel, which senses and recognizes ImageNet tasks with >1000fps. To tackle the computational error of inputs “0” in 3D NAND flash-based CIM, we propose an input-independent offset compensation method, which reduces the average vector-matrix multiplication (VMM) error by 48%. Evaluation results indicate that CIMUS architecture achieves a 9.8× improvement in CNN inference speed and a 33× boost in energy efficiency compared to the state-of-the-art computational image sensor in the ImageNet recognition task.","PeriodicalId":13087,"journal":{"name":"IEEE Transactions on Computers","volume":"74 7","pages":"2321-2333"},"PeriodicalIF":3.6,"publicationDate":"2025-04-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144264218","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}