IEEE Transactions on Parallel and Distributed Systems最新文献_第7页

SMore: Enhancing GPU Utilization in Deep Learning Clusters by Serverless-Based Co-Location Scheduling SMore：通过基于无服务器的协同位置调度提高深度学习集群的GPU利用率

IF 5.6 2区计算机科学

IEEE Transactions on Parallel and Distributed Systems Pub Date : 2025-03-05 DOI: 10.1109/TPDS.2025.3548320

Junhan Liu;Zinuo Cai;Yumou Liu;Hao Li;Zongpu Zhang;Ruhui Ma;Rajkumar Buyya

{"title":"SMore: Enhancing GPU Utilization in Deep Learning Clusters by Serverless-Based Co-Location Scheduling","authors":"Junhan Liu;Zinuo Cai;Yumou Liu;Hao Li;Zongpu Zhang;Ruhui Ma;Rajkumar Buyya","doi":"10.1109/TPDS.2025.3548320","DOIUrl":"https://doi.org/10.1109/TPDS.2025.3548320","url":null,"abstract":"Deep learning (DL) clusters allow machine learning practitioners to submit their computation-intensive tasks, with GPUs accelerating their execution process. However, GPUs in current deep learning clusters are often under-utilized, which hampers the job performance and overall cluster throughput. It is urgent to improve GPU utilization, but existing works lack research on fine-grained allocation for GPU resources, as it typically allocates GPUs as indivisible units. Serverless computing reveals an opportunity to optimize utilization with fine-grained resource allocation methods, but it requires addressing three main challenges: co-location performance degradation, service level objectives guarantee of serverless functions, and cold start overhead. We propose <sc>SMore, a framework based on serverless computing to optimize GPU resource utilization of DL clusters. <sc>SMore dynamically predicts the possible co-location performance degradation and leverages a degradation-aware scheduling algorithm to ensure that the co-location decisions do not impact workload performance. It also dynamically preloads or offloads DL models by predicting the request numbers of the subsequent period to address the cold start issue. Through actual trace testing on the prototype of <sc>SMore, we find that the average GPU utilization can be increased by 34% with degradation being controlled effectively.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"36 5","pages":"903-917"},"PeriodicalIF":5.6,"publicationDate":"2025-03-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143808946","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

PimBeam: Efficient Regular Path Queries Over Graph Database Using Processing-in-Memory 使用内存处理的图形数据库的高效规则路径查询

IF 5.6 2区计算机科学

IEEE Transactions on Parallel and Distributed Systems Pub Date : 2025-03-04 DOI: 10.1109/TPDS.2025.3547365

Weihan Kong;Shengan Zheng;Yifan Hua;Ruoyan Ma;Yuheng Wen;Guifeng Wang;Cong Zhou;Linpeng Huang

引用次数: 0

Toward Load-Balanced Redundancy Transitioning for Erasure-Coded Storage 面向擦除编码存储的负载均衡冗余转换

IF 5.6 2区计算机科学

IEEE Transactions on Parallel and Distributed Systems Pub Date : 2025-03-04 DOI: 10.1109/TPDS.2025.3547872

Keyun Cheng;Huancheng Puyang;Xiaolu Li;Patrick P. C. Lee;Yuchong Hu;Jie Li;Ting-Yi Wu

引用次数: 0

Towards Communication-Efficient Out-of-Core Graph Processing on the GPU GPU上高效通信的核外图形处理

IF 5.6 2区计算机科学

IEEE Transactions on Parallel and Distributed Systems Pub Date : 2025-03-04 DOI: 10.1109/TPDS.2025.3547356

Qiange Wang;Xin Ai;Yongze Yan;Shufeng Gong;Yanfeng Zhang;Jing Chen;Ge Yu

{"title":"Towards Communication-Efficient Out-of-Core Graph Processing on the GPU","authors":"Qiange Wang;Xin Ai;Yongze Yan;Shufeng Gong;Yanfeng Zhang;Jing Chen;Ge Yu","doi":"10.1109/TPDS.2025.3547356","DOIUrl":"https://doi.org/10.1109/TPDS.2025.3547356","url":null,"abstract":"The key performance bottleneck of large-scale graph processing on memory-limited GPUs is the host-GPU graph data transfer. Existing GPU-accelerated graph processing frameworks address this issue by managing the active subgraph transfer at runtime. Some frameworks adopt explicit transfer management approaches based on explicit memory copy with filter or compaction. In contrast, others adopt implicit transfer management approaches based on on-demand accesses with the zero-copy mechanism or unified virtual memory. Having made intensive analysis, we find that as the active vertices evolve, the performance of the two approaches varies in different workloads. Due to heavy redundant data transfers, high CPU compaction overhead, or low bandwidth utilization, adopting a single approach often results in suboptimal performance. Moreover, these methods lack effective cache management methods to address the irregular and sparse memory access pattern of graph processing. In this work, we propose a hybrid transfer management approach that takes the merits of both two transfer approaches at runtime. Moreover, we present an efficient vertex-centric graph caching framework that minimizes CPU-GPU communication by caching frequently accessed graph data at runtime. Based on these techniques, we present HytGraph, a GPU-accelerated graph processing framework, which is empowered by a set of effective task-scheduling optimizations to improve performance. Experiments on real-world and synthetic graphs show that HytGraph achieves average speedups of 2.5 ×, 5.0 ×, and 2.0 × compared to the state-of-the-art GPU-accelerated graph processing systems, Grus, Subway, and EMOGI, respectively.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"36 5","pages":"961-976"},"PeriodicalIF":5.6,"publicationDate":"2025-03-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143809004","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Publicly Verifiable Distributed Computation for MEC Setting MEC设置的可公开验证分布式计算

IF 5.6 2区计算机科学

IEEE Transactions on Parallel and Distributed Systems Pub Date : 2025-03-02 DOI: 10.1109/TPDS.2025.3566080

Qiang Wang;Zhicheng Li;Fucai Zhou;Jian Xu;Changsheng Zhang

{"title":"Publicly Verifiable Distributed Computation for MEC Setting","authors":"Qiang Wang;Zhicheng Li;Fucai Zhou;Jian Xu;Changsheng Zhang","doi":"10.1109/TPDS.2025.3566080","DOIUrl":"https://doi.org/10.1109/TPDS.2025.3566080","url":null,"abstract":"With the rapid expansion of the Internet of Things (IoT), the shift from cloud computing to Mobile Edge Computing (MEC) has become necessary to address the low-latency requirements of real-time applications. Verifiable computation (VC) enables resource-limited clients to outsource their computation-intensive tasks to a powerful cloud while ensuring the correctness of the computation result. However, traditional VC schemes, originally designed for cloud computing, face challenges when applied to MEC environments, such as scalability issues, robustness, and efficiency concerns. To this end, we propose a verifiable distributed computation scheme for MEC, where computation tasks are distributed between a cloud server cluster (consisting of <inline-formula><tex-math>$n$</tex-math></inline-formula> servers) and an edge server. The cloud handles most of the computation through parallel sub-tasks, while the edge server verifies intermediate results and performs minimal computation to recover the final outcome. Our scheme guarantees that the result can be recovered if at least <inline-formula><tex-math>$t$</tex-math></inline-formula> servers, out of a total of <inline-formula><tex-math>$n$</tex-math></inline-formula> servers in the cloud server cluster, perform their computations honestly. By leveraging batch verification and matrix-optimized polynomial evaluations, our scheme significantly enhances scalability, fault tolerance, and efficiency. The extensive analysis and simulations demonstrate that our proposed scheme is more feasible than existing solutions.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"36 7","pages":"1416-1430"},"PeriodicalIF":5.6,"publicationDate":"2025-03-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144171001","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Identifying Performance Inefficiencies of Parallel Program With Spatial and Temporal Trace Analysis 利用空间和时间轨迹分析识别并行程序的性能低下

IF 5.6 2区计算机科学

IEEE Transactions on Parallel and Distributed Systems Pub Date : 2025-03-02 DOI: 10.1109/TPDS.2025.3566735

Zhibo Xuan;Xin Sun;Xin You;Hailong Yang;Zhongzhi Luan;Yi Liu;Depei Qian

{"title":"Identifying Performance Inefficiencies of Parallel Program With Spatial and Temporal Trace Analysis","authors":"Zhibo Xuan;Xin Sun;Xin You;Hailong Yang;Zhongzhi Luan;Yi Liu;Depei Qian","doi":"10.1109/TPDS.2025.3566735","DOIUrl":"https://doi.org/10.1109/TPDS.2025.3566735","url":null,"abstract":"Performance inefficiencies can lead to performance anomalies in parallel programs. Existing performance analysis tools either have a limited detection scope or require significant domain knowledge to use, which constrains their practical adoption to identify performance inefficiencies. In this paper, we propose <italic>STAD, a performance analysis tool for parallel programs that considers both spatial and temporal patterns within trace data. <italic>STAD captures the spatial communication patterns between processes using a spatial communication pattern graph. It then adopts a dynamic graph neural network-based unsupervised model to learn the evolving temporal patterns along the timeline. Additionally, <italic>STAD diagnoses the root causes of performance anomalies by exploiting the aggregated feature of anomalies along the call tree. Our evaluation results demonstrate that <italic>STAD can effectively detect performance anomalies with acceptable overhead and diagnose the root causes attributed to both the program itself and the running environment.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"36 7","pages":"1387-1400"},"PeriodicalIF":5.6,"publicationDate":"2025-03-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144100076","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

IoT-Dedup: Device Relationship-Based IoT Data Deduplication Scheme IoT- dedup：基于设备关系的物联网重复数据删除方案

IF 5.6 2区计算机科学

IEEE Transactions on Parallel and Distributed Systems Pub Date : 2025-02-21 DOI: 10.1109/TPDS.2025.3544315

Yuan Gao;Liquan Chen;Jianchang Lai;Tianyi Wang;Xiaoming Wu;Shui Yu

{"title":"IoT-Dedup: Device Relationship-Based IoT Data Deduplication Scheme","authors":"Yuan Gao;Liquan Chen;Jianchang Lai;Tianyi Wang;Xiaoming Wu;Shui Yu","doi":"10.1109/TPDS.2025.3544315","DOIUrl":"https://doi.org/10.1109/TPDS.2025.3544315","url":null,"abstract":"The cyclical and continuous working characteristics of <italic>Internet of Things (<italic>IoT) devices make a large amount of the same or similar data, which can significantly consume storage space. To solve this problem, various secure data deduplication schemes have been proposed. However, existing deduplication schemes only perform deduplication based on data similarity, ignoring the internal connection among devices, making the existing schemes not directly applicable to parallel and distributed scenarios like IoT. Furthermore, since secure data deduplication leads to multiple users sharing same encryption key, which may lead to security issues. To this end, we propose a device relationship-based IoT data deduplication scheme that fully considers the IoT data characteristics and devices internal connections. Specifically, we propose a device relationship prediction approach, which can obtain device collaborative relationships by clustering the topology of their communication graph, and classifies the data types based on device relationships to achieve data deduplication with different security levels. Then, we design a similarity-preserving encryption algorithm, so that the security level of encryption key is determined by the data type, ensuring the security of the deduplicated data. In addition, two different data deduplication methods, identical deduplication and similar deduplication, have been designed to meet the privacy requirement of different data types, improving the efficiency of deduplication while ensuring data privacy as much as possible. We evaluate the performance of our scheme using five real datasets, and the results show that our scheme has favorable results in terms of both deduplication performance and computational cost.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"36 5","pages":"847-860"},"PeriodicalIF":5.6,"publicationDate":"2025-02-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143808837","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Courier: A Unified Communication Agent to Support Concurrent Flow Scheduling in Cluster Computing Courier：支持集群计算中并发流调度的统一通信代理

IF 5.6 2区计算机科学

IEEE Transactions on Parallel and Distributed Systems Pub Date : 2025-02-20 DOI: 10.1109/TPDS.2025.3543882

Zhaochen Zhang;Xu Zhang;Zhaoxiang Bao;Liang Wei;Chaohong Tan;Wanchun Dou;Guihai Chen;Chen Tian

{"title":"Courier: A Unified Communication Agent to Support Concurrent Flow Scheduling in Cluster Computing","authors":"Zhaochen Zhang;Xu Zhang;Zhaoxiang Bao;Liang Wei;Chaohong Tan;Wanchun Dou;Guihai Chen;Chen Tian","doi":"10.1109/TPDS.2025.3543882","DOIUrl":"https://doi.org/10.1109/TPDS.2025.3543882","url":null,"abstract":"As one of the pillars in cluster computing frameworks, coflow scheduling algorithms can effectively shorten the network transmission time of cluster computing jobs, thus reducing the job completion times and improving the execution performance. However, most of existing coflow scheduling algorithms failed to consider the influences of concurrent flows, which can degrade their performance under a massive number of concurrent flows. To fill the gap, we propose a unified communication agent named Courier to minimize the number of concurrent flows in cluster computing applications, which is compatible with the mainstream coflow scheduling approaches. To maintain the scheduling order given by the scheduling algorithms, Courier merges multiple flows between each pair of hosts into a unified flow, and determines its order based on that of origin flows. In addition, in order to adapt to various types of topologies, Courier introduces a control mechanism to adjust the number of flows while maintaining the scheduling order. Extensive large-scale trace-driven simulations have shown that Courier is compatible with existing scheduling algorithms, and outperforms the state-of-the-art approaches by about 30% under a variety of workloads and topologies.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"36 5","pages":"861-876"},"PeriodicalIF":5.6,"publicationDate":"2025-02-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143821656","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Spread+: Scalable Model Aggregation in Federated Learning With Non-IID Data 扩展+：非iid数据联邦学习中的可扩展模型聚合

IF 5.6 2区计算机科学

IEEE Transactions on Parallel and Distributed Systems Pub Date : 2025-02-18 DOI: 10.1109/TPDS.2025.3539738

Huanghuang Liang;Xin Yang;Xiaoming Han;Boan Liu;Chuang Hu;Dan Wang;Xiaobo Zhou;Dazhao Cheng

{"title":"Spread+: Scalable Model Aggregation in Federated Learning With Non-IID Data","authors":"Huanghuang Liang;Xin Yang;Xiaoming Han;Boan Liu;Chuang Hu;Dan Wang;Xiaobo Zhou;Dazhao Cheng","doi":"10.1109/TPDS.2025.3539738","DOIUrl":"https://doi.org/10.1109/TPDS.2025.3539738","url":null,"abstract":"Federated learning (FL) addresses privacy concerns by training models without sharing raw data, overcoming the limitations of traditional machine learning paradigms. However, the rise of smart applications has accentuated the heterogeneity in data and devices, which presents significant challenges for FL. In particular, data skewness among participants can compromise model accuracy, while diverse device capabilities lead to aggregation bottlenecks, causing severe model congestion. In this article, we introduce Spread+, a hierarchical system that enhances FL by organizing clients into clusters and delegating model aggregation to edge devices, thus mitigating these challenges. Spread+ leverages hedonic coalition formation game to optimize customer organization and adaptive algorithms to regulate aggregation intervals within and across clusters. Moreover, it refines the aggregation algorithm to boost model accuracy. Our experiments demonstrate that Spread+ significantly alleviates the central aggregation bottleneck and surpasses mainstream benchmarks, achieving performance improvements of 49.58% over FAVG and 22.78% over Ring-allreduce.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"36 4","pages":"701-716"},"PeriodicalIF":5.6,"publicationDate":"2025-02-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143535516","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Libfork: Portable Continuation-Stealing With Stackless Coroutines Libfork：使用无堆栈协程的可移植连续窃取

IF 5.6 2区计算机科学

IEEE Transactions on Parallel and Distributed Systems Pub Date : 2025-02-18 DOI: 10.1109/TPDS.2025.3543442

Conor J. Williams;James Elliott

{"title":"Libfork: Portable Continuation-Stealing With Stackless Coroutines","authors":"Conor J. Williams;James Elliott","doi":"10.1109/TPDS.2025.3543442","DOIUrl":"https://doi.org/10.1109/TPDS.2025.3543442","url":null,"abstract":"Fully-strict fork-join parallelism is a powerful model for shared-memory programming due to its optimal time-scaling and strong bounds on memory scaling. The latter is rarely achieved due to the difficulty of implementing continuation-stealing in traditional High Performance Computing (HPC) languages – where it is often impossible without modifying the compiler or resorting to non-portable techniques. We demonstrate how stackless-coroutines (a new feature in C++<inline-formula><tex-math>$bm {20}$</tex-math></inline-formula>) can enable fully-portable continuation stealing and present libfork a wait-free fine-grained parallelism library, combining coroutines with user-space, geometric segmented-stacks. We show our approach is able to achieve optimal time/memory scaling, both theoretically and empirically, across a variety of benchmarks. Compared to openMP (libomp), libfork is on average <inline-formula><tex-math>$7.2times$</tex-math></inline-formula> faster and consumes <inline-formula><tex-math>$10times$</tex-math></inline-formula> less memory. Similarly, compared to Intel's TBB, libfork is on average <inline-formula><tex-math>$2.7times$</tex-math></inline-formula> faster and consumes <inline-formula><tex-math>$6.2times$</tex-math></inline-formula> less memory. Additionally, we introduce non-uniform memory access (NUMA) optimizations for schedulers that demonstrate performance matching busy-waiting schedulers.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"36 5","pages":"877-888"},"PeriodicalIF":5.6,"publicationDate":"2025-02-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143808951","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0