Hao Zhou;Yuanhui Chen;Wu Zeng;Lixiao Cui;Gang Wang;Xiaoguang Liu
{"title":"GPComp: Using GPU and SSD-GPU Peer to Peer DMA to Accelerate LSM-Tree Compaction for Key-Value Store","authors":"Hao Zhou;Yuanhui Chen;Wu Zeng;Lixiao Cui;Gang Wang;Xiaoguang Liu","doi":"10.1109/TPDS.2025.3586616","DOIUrl":"https://doi.org/10.1109/TPDS.2025.3586616","url":null,"abstract":"LSM-tree-based Key-value systems are widely used in many internet applications, known for their superior write performance. Compaction operations, responsible for maintaining the pyramidal storage structure of the LSM-tree to ensure acceptable read performance, pose significant performance bottlenecks. The application of high-performance SSDs and lightweight user-space file systems in LSM storage alleviates IO bandwidth bottlenecks, but it amplifies the computational resource consumption of compaction when KV is small and medium, shifting the bottleneck from IO to computation. To mitigate the computational bottleneck of compaction, we propose GPComp, a GPU-accelerated compaction strategy for high-performance SSDs with lightweight user-space file systems. GPComp features efficient GPU compaction units and a CPU-GPU cooperative compaction acceleration strategy. We introduce a user-space file system specifically designed for LSM storage, TopFS-GPU. It implements an SPDK-based SSD-GPU P2P IO stack to enhance data transfer throughput in GPU-accelerated Compaction. It features an asynchronous write-back cache strategy, facilitating mixed read-write workloads in LSM-tree-based key-value systems. Additionally, our pipeline mechanism overlaps GPU computations with SSD-GPU IO, increasing system throughput. Implemented based on LevelDB, GPComp shows up to a 2.65x increase in average write throughput and a 2.32x improvement in mixed read-write throughput, with a P99 tail latency reduction of up to 169.65% compared to state-of-the-art methods.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"36 9","pages":"1920-1936"},"PeriodicalIF":5.6,"publicationDate":"2025-07-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144646663","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Adrian C. Rublein;Fidan Mehmeti;Mark Mahon;Thomas F. La Porta
{"title":"Improved Methods of Task Assignment and Resource Allocation With Preemption in Edge Computing Systems","authors":"Adrian C. Rublein;Fidan Mehmeti;Mark Mahon;Thomas F. La Porta","doi":"10.1109/TPDS.2025.3583966","DOIUrl":"https://doi.org/10.1109/TPDS.2025.3583966","url":null,"abstract":"Edge computing has become a very popular service that enables mobile devices to run complex tasks with the help of network-based computing resources. However, edge clouds are often resource-constrained, which makes resource allocation a challenging issue. In addition, edge cloud servers must make allocation decisions with only limited information available, since the arrival of future client tasks might be impossible to predict, and the states and behavior of neighboring servers might be obscured. We focus on a distributed resource allocation method in which servers operate independently and do not communicate with each other, but interact with clients (tasks) to make allocation decisions. We follow a two-round bidding approach to assign tasks to edge cloud servers, and servers are allowed to preempt previous tasks to allocate more useful ones. We evaluate the performance of our system using realistic simulations and real-world trace data from a high-performance computing cluster. Results show that our heuristic improves system-wide performance by 20-25% over previous work when accounting for the time taken by each approach. In this way, an ideal trade-off between performance and speed is achieved.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"36 9","pages":"1857-1871"},"PeriodicalIF":5.6,"publicationDate":"2025-06-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144646662","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Fanxin Li;Shixiong Zhao;Yuhao Qing;Jianyu Jiang;Xusheng Chen;Heming Cui
{"title":"PipeMesh: Achieving Memory-Efficient Computation-Communication Overlap for Training Large Language Models","authors":"Fanxin Li;Shixiong Zhao;Yuhao Qing;Jianyu Jiang;Xusheng Chen;Heming Cui","doi":"10.1109/TPDS.2025.3583983","DOIUrl":"https://doi.org/10.1109/TPDS.2025.3583983","url":null,"abstract":"Efficiently training large language models (LLMs) on commodity cloud resources remains challenging due to limitations in network bandwidth and accelerator memory capacity. Existing training systems can be categorized based on their pipeline schedules. Depth-first scheduling, employed by systems like Megatron, prioritizes memory efficiency but restricts the overlap between communication and computation, causing accelerators to remain idle for over 20% of the training time. Conversely, breadth-first scheduling maximizes communication overlap but generates excessive intermediate activations, exceeding memory capacity and slowing computation by more than 34%. To address these limitations, we propose a novel elastic pipeline schedule that enables fine-grained control over the trade-off between communication overlap and memory consumption. Our approach determines the number of micro-batches scheduled together according to the communication time and the memory available. Furthermore, we introduce a mixed sharding strategy and a pipeline-aware selective recomputation technique to reduce memory usage. Experimental results demonstrate that our system eliminates most of the 28% all-accelerator idle time caused by communication, with recomputation accounting for less than 1.9% of the training time. Compared to existing baselines, <sc>PipeMesh</small> improves training throughput on commodity clouds by 20.1% to 33.8%.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"36 9","pages":"1872-1889"},"PeriodicalIF":5.6,"publicationDate":"2025-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=11054307","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144646684","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Shengwei Li;Zhiquan Lai;Dongsheng Li;Yanqi Hao;Weijie Liu;Keshi Ge;Xiaoge Deng;Kai Lu
{"title":"Oases: Efficient Large-Scale Model Training on Commodity Servers via Overlapped and Automated Tensor Model Parallelism","authors":"Shengwei Li;Zhiquan Lai;Dongsheng Li;Yanqi Hao;Weijie Liu;Keshi Ge;Xiaoge Deng;Kai Lu","doi":"10.1109/TPDS.2025.3583165","DOIUrl":"https://doi.org/10.1109/TPDS.2025.3583165","url":null,"abstract":"Deep learning is experiencing a rise in large-scale models. Training large-scale models is costly, prompting researchers to train large-scale models on commodity servers that more researchers can access. The massive number of parameters necessitates the use of model parallelism training methods. Existing studies focus on training with pipeline model parallelism. However, the tensor model parallelism (TMP) is inevitable when the model size keeps increasing, where frequent data-dependent communication and computation operations significantly reduce the training efficiency. In this article, we present Oases, an automated TMP method with overlapped communication to accelerate large-scale model training on commodity servers. Oases proposes a fine-grained training operation schedule to maximize overlapping communication and computation that have data dependence. Additionally, we design the Oases planner that searches for the best model parameter partition strategy of TMP to achieve further accelerations. Unlike existing methods, Oases planner is tailored to model the cost of overlapped communication-computation operations. We evaluate Oases on various model settings and two commodity clusters, and compare Oases to four state-of-the-art implementations. Experimental results show that Oases achieves speedups of 1.01–1.48× over the fastest baseline, and speedups of up to 1.95× over Megatron.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"36 9","pages":"1828-1840"},"PeriodicalIF":5.6,"publicationDate":"2025-06-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144646686","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Long Yuan;Zeyu Zhou;Zi Chen;Xuemin Lin;Xiang Zhao;Fan Zhang
{"title":"$text {GPUSCAN}^{++}$: Efficient Structural Graph Clustering on GPUs","authors":"Long Yuan;Zeyu Zhou;Zi Chen;Xuemin Lin;Xiang Zhao;Fan Zhang","doi":"10.1109/TPDS.2025.3582996","DOIUrl":"https://doi.org/10.1109/TPDS.2025.3582996","url":null,"abstract":"Structural clustering is one of the most popular graph clustering methods, which has achieved great performance improvement by utilizing GPUs. Even though, the state-of-the-art GPU-based structural clustering algorithm, <inline-formula><tex-math>$mathsf {GPUSCAN}$</tex-math></inline-formula>, still suffers from efficiency issues since lots of extra costs are introduced for parallelization. Moreover, <inline-formula><tex-math>$mathsf {GPUSCAN}$</tex-math></inline-formula> assumes that the graph is resident in the GPU memory. However, the GPU memory capacity is limited currently while many real-world graphs are big and cannot fit in the GPU memory, which makes <inline-formula><tex-math>$mathsf {GPUSCAN}$</tex-math></inline-formula> unable to handle large graphs. Motivated by this, we present a new GPU-based structural clustering algorithm, <inline-formula><tex-math>${mathsf {GPUSCAN^{++}}}$</tex-math></inline-formula>, in this paper. To address the efficiency issue, we propose a new progressive clustering method tailored for GPUs that not only avoid high parallelization costs but also fully exploits the computing resources of GPUs. To address the GPU memory limitation issue, we propose a partition-based algorithm for structural clustering that can process large graphs with limited GPU memory. We conduct experiments on real graphs, and the experimental results demonstrate that our algorithm can achieve up to 168 times speedup compared with the state-of-the-art GPU-based algorithm when the graph can be resident in the GPU memory. Moreover, our algorithm is scalable to handle large graphs. As an example, our algorithm can finish the structural clustering on a graph with 1.8 billion edges using less than 2 GB GPU memory.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"36 9","pages":"1890-1903"},"PeriodicalIF":5.6,"publicationDate":"2025-06-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144646683","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Parallel Acceleration of Genome Variation Detection on Multi-Zone Heterogeneous System","authors":"Yaning Yang;Xiaoqi Wang;Chengqing Li;Shaoliang Peng","doi":"10.1109/TPDS.2025.3581972","DOIUrl":"https://doi.org/10.1109/TPDS.2025.3581972","url":null,"abstract":"Genomic variation is critical for understanding the genetic basis of disease. Pindel, a widely used structural variant caller, leverages short-read sequencing data to detect variation at single-base resolution; however, its hotspot module imposes substantial computational demands, limiting efficiency in large-scale whole-genome analyses. Heterogeneous architectures offer a promising solution, yet disparities in hardware design and programming models preclude direct porting of the original algorithm. To address this, we introduce MTPindel, a novel heterogeneous parallel optimization framework tailored to the MT-3000 processor. Focusing on Pindel’s most compute-intensive modules, we design multi-core and task-level parallel algorithms that exploit the MT-3000’s accelerator domains to balance and accelerate workload distribution. On 128 MT-3000–equipped nodes of the Tianhe next-generation supercomputer, MTPindel achieves an impressive 122.549 times of speedup and 95.74% parallel efficiency, with only a 0.74% error margin relative to the original implementation. This work represents a pioneering effort in heterogeneous parallelization for variant detection, paving the way for rapid, large-scale genomic analyses in research and clinical settings.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"36 8","pages":"1797-1809"},"PeriodicalIF":5.6,"publicationDate":"2025-06-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144536660","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"An Efficient Speculative Federated Tree Learning System With a Lightweight NN-Based Predictor","authors":"Yuhui Zhang;Hong Liao;Lutan Zhao;Yuncong Shao;Zhihong Tian;XiaoFeng Wang;Dan Meng;Rui Hou","doi":"10.1109/TPDS.2025.3581295","DOIUrl":"https://doi.org/10.1109/TPDS.2025.3581295","url":null,"abstract":"Federated tree-based models are popular in many real-world applications owing to their high accuracy and good interpretability. However, the classical synchronous method causes inefficient federated tree-based model training due to tree node dependencies. Inspired by speculative execution techniques in modern high-performance processors, this paper proposes FTSeir, a novel and efficient speculative federated learning system. Instead of simply waiting, FTSeir optimistically predicts the outcome of the prior tree node. By resolving tree node dependencies with a neural network-based split point predictor, the training tasks of child tree nodes can be executed speculatively in advance via separate threads. This speculation enables cross-layer concurrent training, thus significantly reducing the waiting time. Furthermore, we propose an eager verification mechanism to promptly identify mispredictions, thereby reducing wasted computing resources. On a misprediction, an incomplete rollback is triggered for quick recovery by reusing the output of the mis-speculative training, which reduces computational requirements. We implement FTSeir and evaluate its efficiency in a real-world federated learning setting with six public datasets. Evaluation results demonstrate that FTSeir achieves up to 3.45× and 3.60× speedup over the state-of-the-art gradient boosted decision trees and random forests implementations, respectively.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"36 8","pages":"1728-1743"},"PeriodicalIF":5.6,"publicationDate":"2025-06-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144524423","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Guangyao Zhou;Yiqin Fu;Haocheng Lan;Yuanlun Xie;Wenhong Tian;Rajkumar Buyya;Jianhong Qian;Teng Su
{"title":"Cross-Search With Improved Multi-Dimensional Dichotomy-Based Joint Optimization for Distributed Parallel Training of DNN","authors":"Guangyao Zhou;Yiqin Fu;Haocheng Lan;Yuanlun Xie;Wenhong Tian;Rajkumar Buyya;Jianhong Qian;Teng Su","doi":"10.1109/TPDS.2025.3580098","DOIUrl":"https://doi.org/10.1109/TPDS.2025.3580098","url":null,"abstract":"Distributed parallel training of large-scale deep neural networks (DNN) has attracted the attentions of both artificial intelligence and high-performance distributed computing. One of efficient approaches is the micro-batch-based pipeline parallelism (MBPP), e.g., GPipe and Terapipe. Based on the MBPP, we establish a time-cost model with the basic time function of layers, which considers computing time and communication time simultaneously as well as considers they are nonlinear with the amount of input data. Focusing on the jointly optimal solutions of network division and data partition, we propose a Cross-Search algorithm with Improved Multi-dimensional Dichotomy (CSIMD). Through theoretical derivation, we prove improved multi-dimensional dichotomy (IMD) has appreciable theoretical optimality and linear computational complexity significantly faster than the state-of-the-art methods including dynamic programming and recursive algorithm. Extensive experiments on both CNN-based and transformer-based neural networks demonstrate our proposed CSIMD can obtain optimal network division and data partition schemes under MBPP. On average, the training speeds of CSIMD in CNN- and transformer-based DNNs are respectively <inline-formula><tex-math>$(2.0, 2.5)times$</tex-math></inline-formula> and <inline-formula><tex-math>$(2.66, 5.48)times$</tex-math></inline-formula> of (MBPP-R, MBPP-E).","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"36 8","pages":"1680-1694"},"PeriodicalIF":5.6,"publicationDate":"2025-06-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144481882","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Safe Multi-Agent Deep Reinforcement Learning for the Management of Autonomous Connected Vehicles at Future Intersections","authors":"Rui Zhao;Kui Wang;Yun Li;Yuze Fan;Fei Gao;Zhenhai Gao","doi":"10.1109/TPDS.2025.3580092","DOIUrl":"https://doi.org/10.1109/TPDS.2025.3580092","url":null,"abstract":"As Connected and Autonomous Vehicles (vehicle) evolve, Autonomous Intersection Management (AIM) systems are emerging to enable safe, efficient traffic flow at urban intersections without traffic signals. However, existing AIM systems, whether based on traditional optimization control methods or machine learning, suffer from low computational efficiency and a lack of robustness in ensuring safety, respectively. To overcome these limitations, we propose an innovative AIM scheme rooted in Safe Multi-Agent Deep Reinforcement Learning (MADRL). We initially model the safe MADRL problem as a constrained Markov game (CMG) and tackle it with our multi-agent projective constrained policy optimization (MAPCPO). This method first optimizes policy updates within the Kullback-Leibler divergence trust region to maximize performance, and then projects these optimized policies onto the bounds of risk constraints, thus ensuring safety. Building on this, we introduce a Risk-Bounded RL for Autonomous Intersection Management (RbRL-AIM) algorithm. This algorithm adopts an architecture that consists of an LSTM based policy neural network, a reward value network, and a risk neural network. These components, through the MAPCPO policy, enable continuous learning from complex and random intersection traffic environments, thereby facilitating the safe, efficient, and smooth control of vehicles at intersections. Our method is validated in a CARLA simulation, showing significant gains in computational and traffic efficiency over baseline optimization control methods. Compared to non-safety-aware MADRL methods, our approach achieves zero collisions and improved ride comfort.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"36 8","pages":"1744-1761"},"PeriodicalIF":5.6,"publicationDate":"2025-06-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144524388","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yuanming Zhang;Pinghui Wang;Kuankuan Cheng;Junzhou Zhao;Jing Tao;Jingxin Hai;Junlan Feng;Chao Deng;Xidian Wang
{"title":"Building Accurate and Interpretable Online Classifiers on Edge Devices","authors":"Yuanming Zhang;Pinghui Wang;Kuankuan Cheng;Junzhou Zhao;Jing Tao;Jingxin Hai;Junlan Feng;Chao Deng;Xidian Wang","doi":"10.1109/TPDS.2025.3579121","DOIUrl":"https://doi.org/10.1109/TPDS.2025.3579121","url":null,"abstract":"By integrating machine learning with edge devices, we can augment the capabilities of edge devices, such as IoT devices, household appliances, and wearable technologies. These edge devices generally operate on microcontrollers with inherently limited resources, such as constrained RAM capacity and limited computational power. Nonetheless, they often process data in a high-velocity stream fashion, exemplified by sequences of activities and statuses monitored by advanced industrial sensors. In practical scenarios, models must be interpretable to facilitate troubleshooting and behavior understanding. Implementing machine learning models on edge devices is valuable and challenging, striking a balance between model efficacy and resource constraint. To address this challenge, we introduce our novel Onfesk, which combines online learning algorithms with an innovative interpretable kernel. Specifically, our Onfesk trains an online classifier over the kernel’s feature sketches. Benefiting from our specially designed modules, the kernel’s feature sketches can be efficiently produced, and the memory requirements of the classifier can be significantly reduced. As a result, Onfesk delivers effective and efficient performance in environments with limited resources without compromising on model interpretability. Extensive experiments with diverse real-world datasets have shown that Onfesk outperforms state-of-the-art methods, achieving up to a 7.4% improvement in accuracy within identical memory constraints.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"36 8","pages":"1779-1796"},"PeriodicalIF":5.6,"publicationDate":"2025-06-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144524422","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}