{"title":"Parallel Acceleration of Genome Variation Detection on Multi-Zone Heterogeneous System","authors":"Yaning Yang;Xiaoqi Wang;Chengqing Li;Shaoliang Peng","doi":"10.1109/TPDS.2025.3581972","DOIUrl":"https://doi.org/10.1109/TPDS.2025.3581972","url":null,"abstract":"Genomic variation is critical for understanding the genetic basis of disease. Pindel, a widely used structural variant caller, leverages short-read sequencing data to detect variation at single-base resolution; however, its hotspot module imposes substantial computational demands, limiting efficiency in large-scale whole-genome analyses. Heterogeneous architectures offer a promising solution, yet disparities in hardware design and programming models preclude direct porting of the original algorithm. To address this, we introduce MTPindel, a novel heterogeneous parallel optimization framework tailored to the MT-3000 processor. Focusing on Pindel’s most compute-intensive modules, we design multi-core and task-level parallel algorithms that exploit the MT-3000’s accelerator domains to balance and accelerate workload distribution. On 128 MT-3000–equipped nodes of the Tianhe next-generation supercomputer, MTPindel achieves an impressive 122.549 times of speedup and 95.74% parallel efficiency, with only a 0.74% error margin relative to the original implementation. This work represents a pioneering effort in heterogeneous parallelization for variant detection, paving the way for rapid, large-scale genomic analyses in research and clinical settings.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"36 8","pages":"1797-1809"},"PeriodicalIF":5.6,"publicationDate":"2025-06-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144536660","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"An Efficient Speculative Federated Tree Learning System With a Lightweight NN-Based Predictor","authors":"Yuhui Zhang;Hong Liao;Lutan Zhao;Yuncong Shao;Zhihong Tian;XiaoFeng Wang;Dan Meng;Rui Hou","doi":"10.1109/TPDS.2025.3581295","DOIUrl":"https://doi.org/10.1109/TPDS.2025.3581295","url":null,"abstract":"Federated tree-based models are popular in many real-world applications owing to their high accuracy and good interpretability. However, the classical synchronous method causes inefficient federated tree-based model training due to tree node dependencies. Inspired by speculative execution techniques in modern high-performance processors, this paper proposes FTSeir, a novel and efficient speculative federated learning system. Instead of simply waiting, FTSeir optimistically predicts the outcome of the prior tree node. By resolving tree node dependencies with a neural network-based split point predictor, the training tasks of child tree nodes can be executed speculatively in advance via separate threads. This speculation enables cross-layer concurrent training, thus significantly reducing the waiting time. Furthermore, we propose an eager verification mechanism to promptly identify mispredictions, thereby reducing wasted computing resources. On a misprediction, an incomplete rollback is triggered for quick recovery by reusing the output of the mis-speculative training, which reduces computational requirements. We implement FTSeir and evaluate its efficiency in a real-world federated learning setting with six public datasets. Evaluation results demonstrate that FTSeir achieves up to 3.45× and 3.60× speedup over the state-of-the-art gradient boosted decision trees and random forests implementations, respectively.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"36 8","pages":"1728-1743"},"PeriodicalIF":5.6,"publicationDate":"2025-06-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144524423","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Guangyao Zhou;Yiqin Fu;Haocheng Lan;Yuanlun Xie;Wenhong Tian;Rajkumar Buyya;Jianhong Qian;Teng Su
{"title":"Cross-Search With Improved Multi-Dimensional Dichotomy-Based Joint Optimization for Distributed Parallel Training of DNN","authors":"Guangyao Zhou;Yiqin Fu;Haocheng Lan;Yuanlun Xie;Wenhong Tian;Rajkumar Buyya;Jianhong Qian;Teng Su","doi":"10.1109/TPDS.2025.3580098","DOIUrl":"https://doi.org/10.1109/TPDS.2025.3580098","url":null,"abstract":"Distributed parallel training of large-scale deep neural networks (DNN) has attracted the attentions of both artificial intelligence and high-performance distributed computing. One of efficient approaches is the micro-batch-based pipeline parallelism (MBPP), e.g., GPipe and Terapipe. Based on the MBPP, we establish a time-cost model with the basic time function of layers, which considers computing time and communication time simultaneously as well as considers they are nonlinear with the amount of input data. Focusing on the jointly optimal solutions of network division and data partition, we propose a Cross-Search algorithm with Improved Multi-dimensional Dichotomy (CSIMD). Through theoretical derivation, we prove improved multi-dimensional dichotomy (IMD) has appreciable theoretical optimality and linear computational complexity significantly faster than the state-of-the-art methods including dynamic programming and recursive algorithm. Extensive experiments on both CNN-based and transformer-based neural networks demonstrate our proposed CSIMD can obtain optimal network division and data partition schemes under MBPP. On average, the training speeds of CSIMD in CNN- and transformer-based DNNs are respectively <inline-formula><tex-math>$(2.0, 2.5)times$</tex-math></inline-formula> and <inline-formula><tex-math>$(2.66, 5.48)times$</tex-math></inline-formula> of (MBPP-R, MBPP-E).","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"36 8","pages":"1680-1694"},"PeriodicalIF":5.6,"publicationDate":"2025-06-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144481882","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Safe Multi-Agent Deep Reinforcement Learning for the Management of Autonomous Connected Vehicles at Future Intersections","authors":"Rui Zhao;Kui Wang;Yun Li;Yuze Fan;Fei Gao;Zhenhai Gao","doi":"10.1109/TPDS.2025.3580092","DOIUrl":"https://doi.org/10.1109/TPDS.2025.3580092","url":null,"abstract":"As Connected and Autonomous Vehicles (vehicle) evolve, Autonomous Intersection Management (AIM) systems are emerging to enable safe, efficient traffic flow at urban intersections without traffic signals. However, existing AIM systems, whether based on traditional optimization control methods or machine learning, suffer from low computational efficiency and a lack of robustness in ensuring safety, respectively. To overcome these limitations, we propose an innovative AIM scheme rooted in Safe Multi-Agent Deep Reinforcement Learning (MADRL). We initially model the safe MADRL problem as a constrained Markov game (CMG) and tackle it with our multi-agent projective constrained policy optimization (MAPCPO). This method first optimizes policy updates within the Kullback-Leibler divergence trust region to maximize performance, and then projects these optimized policies onto the bounds of risk constraints, thus ensuring safety. Building on this, we introduce a Risk-Bounded RL for Autonomous Intersection Management (RbRL-AIM) algorithm. This algorithm adopts an architecture that consists of an LSTM based policy neural network, a reward value network, and a risk neural network. These components, through the MAPCPO policy, enable continuous learning from complex and random intersection traffic environments, thereby facilitating the safe, efficient, and smooth control of vehicles at intersections. Our method is validated in a CARLA simulation, showing significant gains in computational and traffic efficiency over baseline optimization control methods. Compared to non-safety-aware MADRL methods, our approach achieves zero collisions and improved ride comfort.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"36 8","pages":"1744-1761"},"PeriodicalIF":5.6,"publicationDate":"2025-06-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144524388","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yuanming Zhang;Pinghui Wang;Kuankuan Cheng;Junzhou Zhao;Jing Tao;Jingxin Hai;Junlan Feng;Chao Deng;Xidian Wang
{"title":"Building Accurate and Interpretable Online Classifiers on Edge Devices","authors":"Yuanming Zhang;Pinghui Wang;Kuankuan Cheng;Junzhou Zhao;Jing Tao;Jingxin Hai;Junlan Feng;Chao Deng;Xidian Wang","doi":"10.1109/TPDS.2025.3579121","DOIUrl":"https://doi.org/10.1109/TPDS.2025.3579121","url":null,"abstract":"By integrating machine learning with edge devices, we can augment the capabilities of edge devices, such as IoT devices, household appliances, and wearable technologies. These edge devices generally operate on microcontrollers with inherently limited resources, such as constrained RAM capacity and limited computational power. Nonetheless, they often process data in a high-velocity stream fashion, exemplified by sequences of activities and statuses monitored by advanced industrial sensors. In practical scenarios, models must be interpretable to facilitate troubleshooting and behavior understanding. Implementing machine learning models on edge devices is valuable and challenging, striking a balance between model efficacy and resource constraint. To address this challenge, we introduce our novel Onfesk, which combines online learning algorithms with an innovative interpretable kernel. Specifically, our Onfesk trains an online classifier over the kernel’s feature sketches. Benefiting from our specially designed modules, the kernel’s feature sketches can be efficiently produced, and the memory requirements of the classifier can be significantly reduced. As a result, Onfesk delivers effective and efficient performance in environments with limited resources without compromising on model interpretability. Extensive experiments with diverse real-world datasets have shown that Onfesk outperforms state-of-the-art methods, achieving up to a 7.4% improvement in accuracy within identical memory constraints.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"36 8","pages":"1779-1796"},"PeriodicalIF":5.6,"publicationDate":"2025-06-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144524422","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"VAHRM: Variation-Aware Resource Management in Heterogeneous Supercomputing Systems","authors":"Kohei Yoshida;Ryuichi Sakamoto;Kento Sato;Abhinav Bhatele;Hayato Yamaki;Hiroki Honda;Shinobu Miwa","doi":"10.1109/TPDS.2025.3577252","DOIUrl":"https://doi.org/10.1109/TPDS.2025.3577252","url":null,"abstract":"In this article, we propose a novel resource management technique for heterogeneous supercomputing systems affected by manufacturing variability. Our proposed technique called VAHRM (Variation-Aware Heterogeneous Resource Management) takes a holistic approach to job scheduling on highly heterogeneous computing resources. VAHRM preferentially allocates energy-efficient computing resources to an energy-consuming job in a job queue, considering the impact on both the job turnaround time and the power consumption of individual resources. Furthermore, we have developed a novel approach to modeling the power consumption of computing resources that have manufacturing variability. Our approach called TSMVA (Two-Stage Modeling with Variation Awareness) enables us to generate the first variation-aware GPU power models, which can correctly estimate the power consumption of each GPU for a given job. Our experimental results show that, compared to conventional first-come-first-serve (FCFS) and state-of-the-art variation-aware scheduling algorithms, VAHRM can achieve respective improvements in system energy efficiency of up to 5.8% and 5.4% (4.5% and 4.2% on average) while reducing the average turnaround time of 21.2% and 11.9%, respectively, for various workloads obtained from a production system.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"36 8","pages":"1713-1727"},"PeriodicalIF":5.6,"publicationDate":"2025-06-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=11031465","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144524424","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Boosting Resource-Constrained Federated Learning Systems With Guessed Updates","authors":"Mohamed Yassine Boukhari;Akash Dhasade;Anne-Marie Kermarrec;Rafael Pires;Othmane Safsafi;Rishi Sharma","doi":"10.1109/TPDS.2025.3578522","DOIUrl":"https://doi.org/10.1109/TPDS.2025.3578522","url":null,"abstract":"Federated learning (FL) enables a set of client devices to collaboratively train a model without sharing raw data. This process, though, operates under the constrained computation and communication resources of edge devices. These constraints combined with systems heterogeneity force some participating clients to perform fewer local updates than expected by the server, thus slowing down convergence. Exhaustive tuning of hyperparameters in FL, furthermore, can be resource-intensive, without which the convergence is adversely affected. In this work, we propose <sc>GeL</small>, the guess and learn algorithm. <sc>GeL</small> enables constrained edge devices to perform additional learning through guessed updates on top of gradient-based steps. These guesses are <italic>gradientless</i>, i.e., participating clients leverage them <italic>for free</i>. Our generic guessing algorithm (i) can be flexibly combined with several state-of-the-art algorithms including <sc>FedProx</small>, <sc>FedNova</small>, <sc>FedYogi</small> or <sc>ScaleFL</small>; and (ii) achieves significantly improved performance when the learning rates are not best tuned. We conduct extensive experiments and show that <sc>GeL</small> can boost empirical convergence by up to 40% in resource-constrained networks while relieving the need for exhaustive learning rate tuning.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"36 8","pages":"1666-1679"},"PeriodicalIF":5.6,"publicationDate":"2025-06-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144502884","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"ISACPP: Interference-Aware Scheduling Approach for Deep Learning Training Workloads Based on Co-Location Performance Prediction","authors":"Zijie Liu;Yi Cheng;Can Chen;Jun Hu;Rongguo Fu;Dengyin Zhang","doi":"10.1109/TPDS.2025.3577796","DOIUrl":"https://doi.org/10.1109/TPDS.2025.3577796","url":null,"abstract":"Traditional exclusive cloud resource allocation for deep learning training (DLT) workloads is unsuitable for advanced GPU infrastructure, leading to resource under-utilization. Fortunately, DLT workload co-location provides a promising way to improve resource utilization. However, existing workload co-location methods fail to accurately quantify interference among DLT workloads, resulting in performance degradation. To address this problem, this article proposes an interference-aware scheduling approach for DLT workloads based on co-location performance prediction, dubbed ‘ISACPP’. ISACPP first builds an edge-fusion gated graph attention network (E-GGAT) that incorporates DL model structures, underlying GPU types, and hyper-parameter settings to predict co-location performance. Since the co-location state changes as each workload is completed, ISACPP proposes a multi-stage co-location interference quantification model derived from the predicted co-location performance to identify the GPU device with the minimum overall interference. Experimental results demonstrate that ISACPP can accurately estimate the co-location performance of DLT workloads with a maximum prediction error of 8.72%, 1.9%, and 4.4% for execution time, GPU memory consumption, and GPU utilization, respectively. Meanwhile, ISACPP can significantly shorten workload makespan by up to 34.9% compared to state-of-the-art interference-aware scheduling methods.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"36 8","pages":"1591-1607"},"PeriodicalIF":5.6,"publicationDate":"2025-06-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144323113","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Lin Qiu;Xingwei Wang;Bo Yi;Kaimin Zhang;Fei Gao;Min Huang;Yanpeng Qu
{"title":"Towards Efficiency and Decentralization: A Blockchain Assisted Distributed Fuzzy-Rough Feature Selection","authors":"Lin Qiu;Xingwei Wang;Bo Yi;Kaimin Zhang;Fei Gao;Min Huang;Yanpeng Qu","doi":"10.1109/TPDS.2025.3578032","DOIUrl":"https://doi.org/10.1109/TPDS.2025.3578032","url":null,"abstract":"Fuzzy-rough sets-based feature selection (FRFS), as an effective data pre-processing technique, has drawn significant attention with the growing prevalence of large-scale datasets. However, centralized FRFS approaches suffer from the following shortcomings: 1) low computational efficiency, 2) bottlenecks in memory and computational resources, and 3) strict limitation of collaborative implementation using non-shared datasets owned by different data providers. These limitations highlight the growing necessity of integrating FRFS into a distributed FS framework. Nevertheless, most existing distributed FS schemes are reliant on a designated central server to collect and merge the local results from all slave nodes, which may result in several challenges including single point of failure risk, lack of trust and reliability, and lack of transparency and traceability. To relieve the above issues, this paper proposes a blockchain assisted distributed FS framework, successfully implementing a distributed solution for FRFS (BDFRFS). First, this framework introduces blockchain to merge, reach consensus and publish the global results generated during each iteration of FRFS, including the currently selected feature subset with its corresponding similarity matrix and dependency degree. This not only eliminates the reliance of central server and alleviates the burden on the central server, but also enhances the credibility and traceability of the results. Additionally, the implementation of FRFS is designed within this framework, utilizing three strategies to improve the efficiency of centralized FRFS: 1) eliminating the irrelevant and redundant features prior to the executing FRFS; 2) removing redundant and unnecessary computations involved in generating the similarity matrices; and 3) enabling parallel computation of dependency degrees. Finally, the experimental results conducted on eight large-scale datasets demonstrate that the proposed framework can significantly reduce the runtime cost and improve the classification accuracy compared to centralized FRFS and several distributed FS approaches.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"36 8","pages":"1762-1778"},"PeriodicalIF":5.6,"publicationDate":"2025-06-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144536658","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Weiyi Sun;Jianfeng Zhu;Mingyu Gao;Zhaoshi Li;Shaojun Wei;Leibo Liu
{"title":"SSS-DIMM: Removing Redundant Data Movement in Trusted DIMM-Based Near-Memory-Processing Kernel Offloading via Secure Space Sharing","authors":"Weiyi Sun;Jianfeng Zhu;Mingyu Gao;Zhaoshi Li;Shaojun Wei;Leibo Liu","doi":"10.1109/TPDS.2025.3576438","DOIUrl":"https://doi.org/10.1109/TPDS.2025.3576438","url":null,"abstract":"DIMM-based Near-Memory-Processing (NMP) kernel offloading enables a program to execute in computation-enabled DIMM buffer chips, bypassing the bandwidth-constrained CPU main memory bus for high performance. Yet, it also enables programs to access memory without restrictions and protection from CPU, resulting in potential security hazards. To protect general NMP kernel offloading even with malicious privileged software, a heterogeneous TEE is required. However, for architectural design simplification, the conventional heterogeneous TEE design isolates host CPU process from NMP kernel’s memory and vice versa, such that CPU TEE and trusted NMP driver can protect CPU processes and NMP kernels in complete separation. Such isolation results in redundant input/output data movement between the two isolated memory spaces, with half of the movement performed by host CPU. Worsened by limited CPU memory bandwidth, we identify that such redundancy severely bottlenecks the performance of many potential NMP applications. To overcome this bottleneck, we propose to abandon isolation and share the NMP kernel memory with its host CPU process. Based on this idea, we design <underline>SSS-DIMM</u>, an efficient TEE for DIMM-based NMP kernel offloading that removes the redundant data movement via <underline>S</u>ecure <underline>S</u>pace <underline>S</u>haring. SSS-DIMM resolves the two security challenges faced by memory sharing: to provide consistent security guarantees on CPU processes and NMP kernels with CPU TEE and the NMP driver for both memory ownership (allocation) and views (mapping), and to ensure that cryptography metadata be securely shared and synchronized between CPU and NMP unit. Our evaluation shows that SSS-DIMM maintains both security and high performance.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"36 8","pages":"1810-1827"},"PeriodicalIF":5.6,"publicationDate":"2025-06-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144536659","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}