IEEE Transactions on Parallel and Distributed Systems最新文献_第2页

Chameleon: An Efficient FHE Scheme Switching Acceleration on GPUs 变色龙：一种高效的gpu切换加速方案

IF 6 2区计算机科学

IEEE Transactions on Parallel and Distributed Systems Pub Date : 2025-09-02 DOI: 10.1109/TPDS.2025.3604866

Zhiwei Wang;Haoqi He;Lutan Zhao;Peinan Li;Zhihao Li;Dan Meng;Rui Hou

{"title":"Chameleon: An Efficient FHE Scheme Switching Acceleration on GPUs","authors":"Zhiwei Wang;Haoqi He;Lutan Zhao;Peinan Li;Zhihao Li;Dan Meng;Rui Hou","doi":"10.1109/TPDS.2025.3604866","DOIUrl":"https://doi.org/10.1109/TPDS.2025.3604866","url":null,"abstract":"Fully homomorphic encryption (FHE) enables direct computation on encrypted data, making it a crucial technology for privacy protection. However, FHE suffers from significant performance bottlenecks. In this context, GPU acceleration offers a promising solution to bridge the performance gap. Existing efforts primarily focus on single-class FHE schemes, which fail to meet the diverse requirements of data types and functions, prompting the development of hybrid multi-class FHE schemes. However, studies have yet to thoroughly investigate specific GPU optimizations for hybrid FHE schemes. In this article, we present an efficient GPU-based FHE scheme switching acceleration named Chameleon. First, we propose a scalable NTT acceleration design that adapts to larger CKKS polynomials and smaller TFHE polynomials. Specifically, Chameleon tackles synchronization issues by fusing stages to reduce synchronization, employing polynomial coefficient shuffling to minimize synchronization scale, and utilizing an SM-aware combination strategy to identify the optimal switching point. Second, Chameleon is the first to comprehensively analyze and optimize critical switching operations. It introduces CMux-level parallelization to accelerate LUT evaluation and a homomorphic rotation-free matrix-vector multiplication to improve repacking efficiency. Finally, Chameleon outperforms the state-of-the-art GPU implementations by 1.23× in CKKS HMUL and 1.15× in bootstrapping. It also achieves up to 4.87× and 1.51× speedups for TFHE bootstrapping compared to CPU and GPU versions, respectively, and delivers a 67.3× average speedup for scheme switching over CPU-based implementation.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"36 11","pages":"2264-2280"},"PeriodicalIF":6.0,"publicationDate":"2025-09-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=11146703","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145210169","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

PreTrans: Enabling Efficient CGRA Multi-Task Context Switch Through Config Pre-Mapping and Data Transceiving PreTrans：通过配置预映射和数据收发实现高效的CGRA多任务上下文切换

IF 6 2区计算机科学

IEEE Transactions on Parallel and Distributed Systems Pub Date : 2025-09-02 DOI: 10.1109/TPDS.2025.3604815

Yufei Yang;Chenhao Xie;Liansheng Liu;Xiyuan Peng;Yu Peng;Hailong Yang;Depei Qian

{"title":"PreTrans: Enabling Efficient CGRA Multi-Task Context Switch Through Config Pre-Mapping and Data Transceiving","authors":"Yufei Yang;Chenhao Xie;Liansheng Liu;Xiyuan Peng;Yu Peng;Hailong Yang;Depei Qian","doi":"10.1109/TPDS.2025.3604815","DOIUrl":"https://doi.org/10.1109/TPDS.2025.3604815","url":null,"abstract":"Dynamic resource allocation guarantees the performance of CGRA multi-task, but incurs a wide range of incompatible contexts (config & data) to the CGRA architecture. However, traditional context switch approaches including online config transformation and data reloading may significantly block the task to process inputs under new resource allocation decisions, resulting in the limited task throughput. To address this issue, online config transformation can be avoided if compatible configs have been prepared through offline pre-mapping, but traditional CGRA mappers require days to achieve comprehensive pre-mapping with considerable quality. Besides, online data reloading can also be eliminated through memory sharing, but the traditional arbiter-based approach has the difficulty of trading off physical complexity and memory access parallelism. PreTrans is the first system design to achieve the efficient CGRA multi-task context switch. PreTrans first avoids the online config transformation through a software incremental pre-mapper, which re-utilizes the previously finished pre-mapping results to dramatically accelerate the pre-mapping of subsequent resource allocation decisions with negligible mapping quality loss. Second, PreTrans replaces the traditional arbiter with a hardware data transceiver to better support the memory sharing that eliminates data reloading, which allows each tile to possess an individual memory that maximizes the access parallelism without introducing significant physical overhead. The overall evaluation demonstrates that PreTrans achieves 1.13 <inline-formula><tex-math>$sim 2.46times$</tex-math></inline-formula> throughput improvement on pipeline and parallel multi-task scenarios, and can reach the target throughput immediately after the new resource allocation decision takes effect. Ablation study further shows that the pre-mapper is more than 3 magnitudes faster than the traditional CGRA mapper while maintaining more than 99% of the optimal mapping quality, and the data transceiver only introduces 9.02% hardware area overhead under 16 × 16 CGRA.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"36 11","pages":"2214-2228"},"PeriodicalIF":6.0,"publicationDate":"2025-09-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145210128","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

A Sparse Function Prediction Approach for Cold Start Optimization and User Satisfaction Guarantee in Serverless 无服务器冷启动优化与用户满意度保证的稀疏函数预测方法

IF 6 2区计算机科学

IEEE Transactions on Parallel and Distributed Systems Pub Date : 2025-08-25 DOI: 10.1109/TPDS.2025.3602440

Wang Zhang;Yuyang Zhu;Zhan Shi;Manyu Dang;Yutong Wu;Fang Wang;Dan Feng

{"title":"A Sparse Function Prediction Approach for Cold Start Optimization and User Satisfaction Guarantee in Serverless","authors":"Wang Zhang;Yuyang Zhu;Zhan Shi;Manyu Dang;Yutong Wu;Fang Wang;Dan Feng","doi":"10.1109/TPDS.2025.3602440","DOIUrl":"https://doi.org/10.1109/TPDS.2025.3602440","url":null,"abstract":"Serverless computing relies on keeping functions alive or pre-warming them before invocation to mitigate the cold start problem, stemming from the overhead of initializing function startup environments. However, under constrained cloud resources, accurately predicting the invocation patterns of sparse functions remains challenging. This limits the formulation of effective pre-warm and keep-alive strategies, leading to frequent cold starts and degraded user satisfaction. To address these challenges, we propose <italic>SPFaaS, a hybrid framework based on sparse function prediction. To enhance the learnability of sparse function invocation data, <italic>SPFaaS takes into account the characteristics of cloud service workloads along with the features of pre-warm and keep-alive strategies, transforming function invocation records into probabilistic data. It captures the underlying periodicity and temporal dependencies in the data through multiple rounds of sampling and the combined use of Gated Recurrent Units and Temporal Convolutional Networks for accurate prediction. Based on the final prediction outcome and real-time system states, <italic>SPFaaS determines adaptive pre-warm and keep-alive strategies for each function. Experiments conducted on two real-world serverless clusters demonstrate that <italic>SPFaaS outperforms state-of-the-art methods in reducing cold starts and improving user satisfaction.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"36 11","pages":"2198-2213"},"PeriodicalIF":6.0,"publicationDate":"2025-08-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145141738","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Mapping Large-Scale Spiking Neural Network on Arbitrary Meshed Neuromorphic Hardware 在任意网格神经形态硬件上映射大规模脉冲神经网络

IF 6 2区计算机科学

IEEE Transactions on Parallel and Distributed Systems Pub Date : 2025-08-25 DOI: 10.1109/TPDS.2025.3601993

Ouwen Jin;Qinghui Xing;Zhuo Chen;Ming Zhang;De Ma;Ying Li;Xin Du;Shuibing He;Shuiguang Deng;Gang Pan

{"title":"Mapping Large-Scale Spiking Neural Network on Arbitrary Meshed Neuromorphic Hardware","authors":"Ouwen Jin;Qinghui Xing;Zhuo Chen;Ming Zhang;De Ma;Ying Li;Xin Du;Shuibing He;Shuiguang Deng;Gang Pan","doi":"10.1109/TPDS.2025.3601993","DOIUrl":"https://doi.org/10.1109/TPDS.2025.3601993","url":null,"abstract":"Neuromorphic hardware systems—designed as 2D-mesh structures with parallel neurosynaptic cores—have proven highly efficient at executing large-scale spiking neural networks (SNNs). A critical challenge, however, lies in mapping neurons efficiently to these cores. While existing approaches work well with regular, fully functional mesh structures, they falter in real-world scenarios where hardware has irregular shapes or non-functional cores caused by defects or resource fragmentation. To address these limitations, we propose a novel mapping method based on an innovative space-filling curve: the Adaptive Locality-Preserving (ALP) curve. Using a unique divide-and-conquer construction algorithm, the ALP curve ensures adaptability to meshes of any shape while maintaining crucial locality properties—essential for efficient mapping. Our method demonstrates exceptional computational efficiency, making it ideal for large-scale deployments. These distinctive characteristics enable our approach to handle complex scenarios that challenge conventional methods. Experimental results show that our method matches state-of-the-art solutions in regular-shape mapping while achieving significant improvements in irregular scenarios, reducing communication overhead by up to 57.1%.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"36 11","pages":"2325-2340"},"PeriodicalIF":6.0,"publicationDate":"2025-08-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145210126","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

EdgeAIBus: AI-Driven Joint Container Management and Model Selection Framework for Heterogeneous Edge Computing EdgeAIBus：人工智能驱动的异构边缘计算联合容器管理和模型选择框架

IF 6 2区计算机科学

IEEE Transactions on Parallel and Distributed Systems Pub Date : 2025-08-25 DOI: 10.1109/TPDS.2025.3602521

Babar Ali;Muhammed Golec;Sukhpal Singh Gill;Felix Cuadrado;Steve Uhlig

{"title":"EdgeAIBus: AI-Driven Joint Container Management and Model Selection Framework for Heterogeneous Edge Computing","authors":"Babar Ali;Muhammed Golec;Sukhpal Singh Gill;Felix Cuadrado;Steve Uhlig","doi":"10.1109/TPDS.2025.3602521","DOIUrl":"https://doi.org/10.1109/TPDS.2025.3602521","url":null,"abstract":"Containerized Edge computing offers lightweight, reliable, and quick solutions to latency-critical Machine Learning (ML) and Deep Learning (DL) applications. Existing solutions considering multiple Quality of Service (QoS) parameters either overlook the intricate relation of QoS parameters or pose significant scheduling overheads. Furthermore, reactive decision-making can damage Edge servers at peak load, incurring escalated costs and wasted computations. Resource provisioning, scheduling, and ML model selection substantially influence energy consumption, user-perceived accuracy, and delay-oriented Service Level Agreement (SLA) violations. Addressing contrasting objectives and QoS simultaneously while avoiding server faults is highly challenging in the exposed heterogeneous and resource-constrained Edge continuum. In this work, we propose the EdgeAIBus framework that offers a novel joint container management and ML model selection algorithm based on Importance Weighted Actor-Learner Architecture to optimize energy, accuracy, SLA violations, and avoid server faults. First, Patch Time Series Transformer (PatchTST) is utilized for CPU usage predictions of Edge servers for its 8.51% Root Mean Squared Error and 5.62% Mean Absolute Error. Leveraging pipelined predictions, EdgeAIBus conducts consolidation, resource oversubscription, and ML/DL model switching with possible migrations to conserve energy, maximize utilization and user-perceived accuracy, and reduce SLA violations. Simulation results show EdgeAIBus oversubscribed 110% cluster-wide CPU with real usage up to 70%, conserved 14 CPU cores, incurred less than 1% SLA violations with 2.54% drop in inference accuracy against industry-led Model Switching Balanced load and Google Kubernetes Optimized schedulers. Google Kubernetes Engine experiments demonstrate 80% oversubscription, 14 CPU cores conservation, 1% SLA violations, and 3.81% accuracy loss against the counterparts. Finally, constrained setting experiment analysis shows that PatchTST and EdgeAIBus can produce decisions within 100 ms in a 1-core and 1 GB memory device.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"36 11","pages":"2412-2424"},"PeriodicalIF":6.0,"publicationDate":"2025-08-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=11139102","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145210107","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

MemTunnel: A CXL-Based Rack-Scale Host Memory Pooling Architecture for Cloud Service MemTunnel：一种基于cxl的机架级云服务主机内存池架构

IF 6 2区计算机科学

IEEE Transactions on Parallel and Distributed Systems Pub Date : 2025-08-20 DOI: 10.1109/TPDS.2025.3598190

Tianchan Guan;Yijin Guan;Zhaoyang Du;Jiacheng Ma;Boyu Tian;Zhao Wang;Teng Ma;Zheng Liu;Yang Kong;Yuan Xie;Mingyu Gao;Guangyu Sun;Hongzhong Zheng;Dimin Niu

{"title":"MemTunnel: A CXL-Based Rack-Scale Host Memory Pooling Architecture for Cloud Service","authors":"Tianchan Guan;Yijin Guan;Zhaoyang Du;Jiacheng Ma;Boyu Tian;Zhao Wang;Teng Ma;Zheng Liu;Yang Kong;Yuan Xie;Mingyu Gao;Guangyu Sun;Hongzhong Zheng;Dimin Niu","doi":"10.1109/TPDS.2025.3598190","DOIUrl":"https://doi.org/10.1109/TPDS.2025.3598190","url":null,"abstract":"Memory underutilization poses a significant challenge in cloud services, leading to performance inefficiencies and resource wastage. The tightly coupled computing and memory resources in cloud servers are identified as the root cause of this problem. To address this issue, memory pooling has been the subject of extensive research for decades, providing centralized or distributed shared memory pools as flexible memory resources for various applications running on different servers. However, existing memory disaggregation solutions sacrifice memory resources, add extra hardware (such as memory boxes/blades/drives), and degrade memory performance to achieve flexibility. To overcome these limitations, this paper proposes MemTunnel, a rack-scale host memory pooling architecture that provides a low-cost memory pooling solution based on Compute Express Link (CXL). MemTunnel is the first hardware and software architecture to offer symmetric, memory-semantic memory pooling over CXL, with an FPGA-based platform to demonstrate its feasibility in a real implementation. MemTunnel is orthogonal to the existing CXL-based memory pool and provides an additional layer of abstraction for memory disaggregation. Evaluation results show that MemTunnel achieves comparable performance to the existing CXL-based memory pool for a single machine and provides better rack-scale performance with minor hardware overheads.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"36 11","pages":"2182-2197"},"PeriodicalIF":6.0,"publicationDate":"2025-08-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145141737","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Exploiting Fine-Grained Task-Level Parallelism for Variant Calling Acceleration 利用细粒度任务级并行性实现变量调用加速

IF 6 2区计算机科学

IEEE Transactions on Parallel and Distributed Systems Pub Date : 2025-08-19 DOI: 10.1109/TPDS.2025.3600285

Menghao Guo;Longlong Chen;Yichi Zhang;Hongyi Guan;Shaojun Wei;Jianfeng Zhu;Leibo Liu

{"title":"Exploiting Fine-Grained Task-Level Parallelism for Variant Calling Acceleration","authors":"Menghao Guo;Longlong Chen;Yichi Zhang;Hongyi Guan;Shaojun Wei;Jianfeng Zhu;Leibo Liu","doi":"10.1109/TPDS.2025.3600285","DOIUrl":"https://doi.org/10.1109/TPDS.2025.3600285","url":null,"abstract":"Variant calling, which identifies genomic differences relative to a reference genome, is critical for understanding disease mechanisms, identifying therapeutic targets, and advancing precision medicine. However, as two critical stages in this process, serial processing in local assembly and the computational dependencies in Pair-HMM make variant calling highly time-consuming. Moreover, optimizing only one of these stages often shifts the performance bottleneck to the other. This article observes that the similarity between reads allows parallel processing in the local assembly and that alignment information from the local assembly can significantly diminish the burdensome computations in Pair-HMM. Accordingly, this article co-optimizes the software and hardware for both steps to achieve the best performance. First, we collect <inline-formula><tex-math>$k$</tex-math></inline-formula>-mer locations in each read during the local assembly process and utilize the similarity between reads to make it parallel. Second, we propose the mPair-HMM algorithm, leveraging location information to split a Pair-HMM computation task into multiple independent sub-tasks, improving the computation’s parallelism. To fully exploit the parallelism stemming from the novel algorithms, we propose an end-to-end accelerator VCAx for variant calling that accelerates both stages in collaboration. Evaluation results demonstrate that our implementation achieves up to a 7× speedup over the GPU baseline for local assembly and a 3.16× performance improvement compared to the state-of-the-art ASIC implementation for Pair-HMM.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"36 11","pages":"2169-2181"},"PeriodicalIF":6.0,"publicationDate":"2025-08-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145210095","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Rethinking Virtual Machines Live Migration for Memory Disaggregation 重新思考虚拟机动态迁移的内存分解

IF 6 2区计算机科学

IEEE Transactions on Parallel and Distributed Systems Pub Date : 2025-08-18 DOI: 10.1109/TPDS.2025.3597149

Xingzi Yu;Xingguo Jia;Jin Zhang;Yun Wang;Senhao Yu;Zhengwei Qi

{"title":"Rethinking Virtual Machines Live Migration for Memory Disaggregation","authors":"Xingzi Yu;Xingguo Jia;Jin Zhang;Yun Wang;Senhao Yu;Zhengwei Qi","doi":"10.1109/TPDS.2025.3597149","DOIUrl":"https://doi.org/10.1109/TPDS.2025.3597149","url":null,"abstract":"Resource underutilization has troubled data centers for several decades. On the CPU front, live migration plays a crucial role in reallocating CPU resources. Nevertheless, contemporary Virtual Machine (VM) live migration methods are burdened by substantial resource consumption. In terms of memory management, disaggregated memory offers an effective solution to enhance memory utilization, but leaves a gap in addressing CPU underutilization. Our findings highlight a considerable opportunity to optimize live migration in the context of disaggregated memory systems. We introduce Anemoi, a resource management system that seamlessly integrates VM live migration with memory disaggregation to address the aforementioned gap. In the context of disaggregated memory, remote memory becomes accessible from destination nodes, effectively eliminating the need for extensive network transmission of memory pages, and thereby significantly reducing migration time. In addition, we propose using memory replicas as an optimization to the live migration system. To mitigate the overhead of potential excessive memory consumption, we develop a dedicated compression algorithm. Our evaluations demonstrate that Anemoi leads to a notable 69% reduction in network bandwidth utilization and an impressive 83% reduction in migration time compared to traditional VM live migration. Additionally, our compression algorithm achieves an outstanding space-saving rate of 83.6%.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"36 11","pages":"2310-2324"},"PeriodicalIF":6.0,"publicationDate":"2025-08-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145141748","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

RL-Based Hybrid CPU Scaling for Soft Deadline Constrained Tasks in Container Clouds 基于rl的容器云软期限约束任务混合CPU扩展

IF 6 2区计算机科学

IEEE Transactions on Parallel and Distributed Systems Pub Date : 2025-08-08 DOI: 10.1109/TPDS.2025.3597195

Yepeng Zhang;Haitao Zhang;Huadong Ma

{"title":"RL-Based Hybrid CPU Scaling for Soft Deadline Constrained Tasks in Container Clouds","authors":"Yepeng Zhang;Haitao Zhang;Huadong Ma","doi":"10.1109/TPDS.2025.3597195","DOIUrl":"https://doi.org/10.1109/TPDS.2025.3597195","url":null,"abstract":"Existing CPU scaling approaches have limitations that can lead to inefficient resource allocation and increased penalty costs for tasks with soft deadlines running in container clouds. First, quota allocation based approaches overlook the gap between the obtainable CPU time and allocated quota, causing inefficient CPU utilization and unexpected task behaviors. Second, core allocation based approaches ignore workload dynamics within decision intervals, potentially increasing contention for CPU time among tasks on the same core. Third, existing approaches lack strategies to allocate more resources to critical tasks that incur higher penalty costs when the node’s capacity is insufficient. This article proposes a reinforcement learning based hybrid CPU scaling approach that allocates quota and cores jointly, aiming to minimize penalty costs for timeouts. Based on the embedding generated from a fine-grained CPU demand series, we allocate CPU quotas and determine a dynamic workload-aware core sharing scheme using an attention mechanism that combines respective demands and global criticality regarding penalty costs. Additionally, we integrate the resource gap, CPU time contention, and penalty costs into the reward function to update our model online. The experimental results show the proposed approach achieves state-of-the-art performance.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"36 10","pages":"2104-2118"},"PeriodicalIF":6.0,"publicationDate":"2025-08-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144904881","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Mariana: Exploring Native SkipList Index Design for Disaggregated Memory 探索分解内存的本地SkipList索引设计

IF 6 2区计算机科学

IEEE Transactions on Parallel and Distributed Systems Pub Date : 2025-08-07 DOI: 10.1109/TPDS.2025.3596988

Xing Wei;Ke Wang;Yinjun Han;Hao Jin;Yaofeng Tu;Huiqi Hu;Xuan Zhou;Minghao Zhao

{"title":"Mariana: Exploring Native SkipList Index Design for Disaggregated Memory","authors":"Xing Wei;Ke Wang;Yinjun Han;Hao Jin;Yaofeng Tu;Huiqi Hu;Xuan Zhou;Minghao Zhao","doi":"10.1109/TPDS.2025.3596988","DOIUrl":"https://doi.org/10.1109/TPDS.2025.3596988","url":null,"abstract":"Memory disaggregation has emerged as a promising architecture for improving resource efficiency by decoupling the computing and memory resources. But building efficient range indices in such an architecture faces three critical challenges: (1) coarse-grained concurrency control schemes for coordinating concurrent read/write operations with node splitting incur high contention under the skewed and write-intensive workloads; (2) existing data layouts fail to balance consistency verification and hardware acceleration via SIMD (Single Instruction Multiple Data); and (3) naive caching schemes struggle to adapt to rapidly changing access patterns. To address these challenges, we propose Mariana, a memory-disaggregated skiplist index that integrates three key innovations. First, it uses a fine-grained (i.e., entry-level) latch mechanism combined with dynamic node resizing to minimize the contention and splitting frequency. Second, it employs a tailored data layout for leaf node, which separates keys and values to enable SIMD acceleration while maintaining consistency checks with minimal write overhead. Third, it implements an adaptive caching strategy that tracks node popularity in real-time to optimize network bandwidth utilization during the index traversal. Experimental results show that Mariana achieves <inline-formula><tex-math>$1.7times$</tex-math></inline-formula> higher throughput under write-intensive workloads and reduces the P90 latency by 23% under the read-intensive workloads, when comparing to the state-of-the-art indices on disaggregated memory.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"36 10","pages":"2137-2151"},"PeriodicalIF":6.0,"publicationDate":"2025-08-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144990153","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0