IEEE Transactions on Parallel and Distributed Systems最新文献_第10页

2024 Reviewers List* 2024审稿人名单*

IF 5.6 2区计算机科学

IEEE Transactions on Parallel and Distributed Systems Pub Date : 2025-01-08 DOI: 10.1109/TPDS.2024.3512712

引用次数: 0

Towards Efficient Verifiable Cloud Storage and Distribution for Large-Scale Data Streaming 面向大规模数据流的高效可验证云存储和分发

IF 5.6 2区计算机科学

IEEE Transactions on Parallel and Distributed Systems Pub Date : 2025-01-07 DOI: 10.1109/TPDS.2025.3526642

Haining Yang;Dengguo Feng;Jing Qin

{"title":"Towards Efficient Verifiable Cloud Storage and Distribution for Large-Scale Data Streaming","authors":"Haining Yang;Dengguo Feng;Jing Qin","doi":"10.1109/TPDS.2025.3526642","DOIUrl":"https://doi.org/10.1109/TPDS.2025.3526642","url":null,"abstract":"Data streaming is an ordered sequence of data continuously generated over time, whose dynamic scale is hard to be predicated in advance. Since the traditional integrity verification primitives are not qualified to check the integrity of the retrieved data and the outsourced database in streaming setting, some specific schemes were proposed by adopting the tree-like authentication structure or the combination of signature and accumulator. However, these schemes are not optimal for the owner. The main concerns can be generalized as how to reduce the size of the authentication information to be less than the scale of the data streaming, and enable the resource-constrained owner to check the data integrity without using challenge. To address the problems, we intend to find a new approach to design the scheme by exploiting the novel technique called decentralized vector commitment (DVC). Towards this goal, we first propose a key exposure-freeness chameleon vector commitment scheme, and then present the efficient DVC technique based on our key exposure-freeness chameleon vector commitment scheme. The scheme is finally constructed by leveraging the efficient DVC technique. Besides the integrity verification, our scheme is also sufficient to efficiently distribute the data to a user who is protected from receiving the stale data. To optimize the performance in concurrently retrieving multiple data, we introduce the batch query that reduces large amounts of communication and computation overheads. The security analysis and performance evaluation show that our solutions are secure and efficient.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"36 3","pages":"487-501"},"PeriodicalIF":5.6,"publicationDate":"2025-01-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143105818","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

HTLL: Latency-Aware Scalable Blocking Mutex 延迟感知的可伸缩阻塞互斥

IF 5.6 2区计算机科学

IEEE Transactions on Parallel and Distributed Systems Pub Date : 2025-01-07 DOI: 10.1109/TPDS.2025.3526859

Ziqu Yu;Jinyu Gu;Zijian Wu;Nian Liu;Jian Guo

{"title":"HTLL: Latency-Aware Scalable Blocking Mutex","authors":"Ziqu Yu;Jinyu Gu;Zijian Wu;Nian Liu;Jian Guo","doi":"10.1109/TPDS.2025.3526859","DOIUrl":"https://doi.org/10.1109/TPDS.2025.3526859","url":null,"abstract":"This article finds that existing mutex locks suffer from throughput collapses or latency collapses, or both, in the oversubscribed scenarios where applications create more threads than the CPU core number, e.g., database applications like mysql use per thread per connection. We make an in-depth performance analysis on existing locks and then identify three design rules for the lock primitive to achieve scalable performance in oversubscribed scenarios. First, to achieve ideal throughput, the lock design should keep adequate number of active competitors. Second, the active competitors should be arranged carefully to avoid the lock-holder preemption problem. Third, to meet latency requirements, the lock design should track the latency of each competitor and reorder the competitors according to the latency requirement. We propose a new lock library called HTLL that satisfies these rules and achieves both high throughput and low latency even when the cores are oversubscribed. HTLL only requires minimal human effort (e.g., add several lines of code) to annotate the latency requirement. Evaluation results show that HTLL achieves scalable performance in the oversubscribed scenarios. Specifically, for the real-world database, LMDB, HTLL can reduce the tail latency by up to 97% with only an average 5% degradation in throughput, compared with state-of-the-art alternatives such as Malthusian, CST, and Mutexee locks; In comparison to the widely used pthread_mutex_lock, it can increase the throughput by up to 22% and decrease the latency by up to 80%. Meanwhile, for the under-subscribed scenarios, it also shows comparable performance than state-of-the-art blocking locks.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"36 3","pages":"471-486"},"PeriodicalIF":5.6,"publicationDate":"2025-01-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143105822","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

SMDP-Based Dynamic Batching for Improving Responsiveness and Energy Efficiency of Batch Services 基于smdp的动态批处理，提高批处理服务的响应能力和能源效率

IF 5.6 2区计算机科学

IEEE Transactions on Parallel and Distributed Systems Pub Date : 2025-01-06 DOI: 10.1109/TPDS.2025.3526283

Yaodan Xu;Sheng Zhou;Zhisheng Niu

{"title":"SMDP-Based Dynamic Batching for Improving Responsiveness and Energy Efficiency of Batch Services","authors":"Yaodan Xu;Sheng Zhou;Zhisheng Niu","doi":"10.1109/TPDS.2025.3526283","DOIUrl":"https://doi.org/10.1109/TPDS.2025.3526283","url":null,"abstract":"For servers incorporating parallel computing resources, batching is a pivotal technique for providing efficient and economical services at scale. Parallel computing resources exhibit heightened computational and energy efficiency when operating with larger batch sizes. However, in the realm of online services, the adoption of a larger batch size may lead to longer response times. This paper aims to provide a dynamic batching scheme that delicately balances latency and efficiency. The system is modeled as a batch service queue with size-dependent service times. Then, the design of dynamic batching is formulated as a semi-Markov decision process (SMDP) problem, with the objective of minimizing the weighted sum of average response time and average power consumption. A method is proposed to derive an approximate optimal SMDP solution, representing the chosen dynamic batching policy. By introducing an abstract cost to reflect the impact of “tail” states, the space complexity and the time complexity of the procedure can decrease by 63.5% and 98%, respectively. Numerical results showcase the superiority of SMDP-based batching policies across various parameter setups. Additionally, the proposed scheme exhibits noteworthy flexibility in balancing power consumption and latency.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"36 4","pages":"659-674"},"PeriodicalIF":5.6,"publicationDate":"2025-01-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143535504","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Paralfetch: Fast Application Launch on Personal Computing/Communication Devices 在个人计算/通信设备上快速启动应用程序

IF 5.6 2区计算机科学

IEEE Transactions on Parallel and Distributed Systems Pub Date : 2025-01-02 DOI: 10.1109/TPDS.2024.3525337

Junhee Ryu;Dongeun Lee;Kang G. Shin;Kyungtae Kang

引用次数: 0

HpT: Hybrid Acceleration of Spatio-Temporal Attention Model Training on Heterogeneous Manycore Architectures 基于异构多核架构的时空注意力模型混合加速训练

IF 5.6 2区计算机科学

IEEE Transactions on Parallel and Distributed Systems Pub Date : 2025-01-01 DOI: 10.1109/TPDS.2024.3522781

Saiman Dahal;Pratyush Dhingra;Krishu Kumar Thapa;Partha Pratim Pande;Ananth Kalyanaraman

{"title":"HpT: Hybrid Acceleration of Spatio-Temporal Attention Model Training on Heterogeneous Manycore Architectures","authors":"Saiman Dahal;Pratyush Dhingra;Krishu Kumar Thapa;Partha Pratim Pande;Ananth Kalyanaraman","doi":"10.1109/TPDS.2024.3522781","DOIUrl":"https://doi.org/10.1109/TPDS.2024.3522781","url":null,"abstract":"Transformer models have become widely popular in numerous applications, and especially for building foundation large language models (LLMs). Recently, there has been a surge in the exploration of transformer-based architectures in non-LLM applications. In particular, the self-attention mechanism within the transformer architecture offers a way to exploit any hidden relations within data, making it widely applicable for a variety of spatio-temporal tasks in scientific computing domains (e.g., weather, traffic, agriculture). Most of these efforts have primarily focused on accelerating the inference phase. However, the computational resources required to train these attention-based models for scientific applications remain a significant challenge to address. Emerging non-volatile memory (NVM)-based processing-in-memory (PIM) architectures can achieve higher performance and better energy efficiency than their GPU-based counterparts. However, the frequent weight updates during training would necessitate write operations to NVM cells, posing a significant barrier for considering stand-alone NVM-based PIM architectures. In this paper, we present <monospace>HpT</monospace>, a new hybrid approach to accelerate the training of attention-based models for scientific applications. Our approach is hybrid at two different layers: at the software layer, our approach dynamically switches from a full-parameter training mode to a lower-parameter training mode by incorporating intrinsic dimensionality; and at the hardware layer, our approach harnesses the combined power of GPUs, resistive random-access memory (ReRAM)-based PIM devices, and systolic arrays. This software-hardware co-design approach is aimed at adaptively reducing both runtime and energy costs during the training phase, without compromising on quality. Experiments on four concrete real-world scientific applications demonstrate that our hybrid approach is able to significantly reduce training time (up to <inline-formula><tex-math>$11.9times$</tex-math></inline-formula>) and energy consumption (up to <inline-formula><tex-math>$12.05times$</tex-math></inline-formula>), compared to the corresponding full-parameter training executing on only GPUs. Our approach serves as an example for accelerating the training of attention-based models on heterogeneous platforms including ReRAMs.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"36 3","pages":"407-421"},"PeriodicalIF":5.6,"publicationDate":"2025-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142992862","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Sparrow: Expediting Smart Contract Execution for Blockchain Sharding via Inter-Shard Caching Sparrow：通过分片间缓存加速区块链分片的智能合约执行

IF 5.6 2区计算机科学

IEEE Transactions on Parallel and Distributed Systems Pub Date : 2024-12-26 DOI: 10.1109/TPDS.2024.3522016

Junyuan Liang;Peiyuan Yao;Wuhui Chen;Zicong Hong;Jianting Zhang;Ting Cai;Min Sun;Zibin Zheng

{"title":"Sparrow: Expediting Smart Contract Execution for Blockchain Sharding via Inter-Shard Caching","authors":"Junyuan Liang;Peiyuan Yao;Wuhui Chen;Zicong Hong;Jianting Zhang;Ting Cai;Min Sun;Zibin Zheng","doi":"10.1109/TPDS.2024.3522016","DOIUrl":"https://doi.org/10.1109/TPDS.2024.3522016","url":null,"abstract":"Sharding is a promising solution to scale blockchain by separating the system into multiple shards to process transactions in parallel. However, due to state separation and shard isolation, it is still challenging to efficiently support smart contracts on a blockchain sharding system where smart contracts can interact with each other, involving states maintained by multiple shards. Specifically, existing sharding systems adopt a costly multi-step collaboration mechanism to execute smart contracts, resulting in long latency and low throughput. This article proposes Sparrow, a blockchain sharding protocol achieving one-step execution for smart contracts. To break shard isolation, inspired by non-local hotspot data caching in traditional databases, we propose a new idea of inter-shard caching, allowing a shard to prefetch and cache frequently accessed contract states of other shards. The miner can thus use the inter-shard cache to pre-execute a pending transaction, retrieve all its contract invocations, and commit it to multiple shards in one step. Particularly, we first propose a speculative dispersal cache synchronisation mechanism for efficient and secure cache synchronization across shards in Byzantine environments. Then, we propose a multi-branch exploration mechanism to solve the rollback problem during the optimistic one-step execution of contract invocations with dependencies. We also present a series of conflict resolution mechanisms to decrease the rollback caused by inherent transaction conflicts. We implement prototypes for Sparrow and existing sharding systems, and the evaluation shows that Sparrow improves the throughput by <inline-formula><tex-math>$2.44times$</tex-math></inline-formula> and reduces the transaction latency by 30% compared with the existing sharding systems.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"36 3","pages":"377-390"},"PeriodicalIF":5.6,"publicationDate":"2024-12-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142992863","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Co-Designing Transformer Architectures for Distributed Inference With Low Communication 低通信分布式推理变压器结构协同设计

IF 5.6 2区计算机科学

IEEE Transactions on Parallel and Distributed Systems Pub Date : 2024-12-25 DOI: 10.1109/TPDS.2024.3521582

Jiangsu Du;Yuanxin Wei;Shengyuan Ye;Jiazhi Jiang;Xu Chen;Dan Huang;Yutong Lu

{"title":"Co-Designing Transformer Architectures for Distributed Inference With Low Communication","authors":"Jiangsu Du;Yuanxin Wei;Shengyuan Ye;Jiazhi Jiang;Xu Chen;Dan Huang;Yutong Lu","doi":"10.1109/TPDS.2024.3521582","DOIUrl":"https://doi.org/10.1109/TPDS.2024.3521582","url":null,"abstract":"Transformer models have shown significant success in a wide range of tasks. However, the massive resources required for its inference prevent deployment on a single device with relatively constrainted resources, thus leaving a high threshold of integrating their advancements. Observing scenarios such as smart home applications on edge devices and cloud deployment on commodity hardware, it is promising to distribute Transformer inference across multiple devices. Unfortunately, due to the tightly-coupled feature of Transformer model, existing model parallelism approaches necessitate frequent communication to resolve data dependencies, making them unacceptable for distributed inference, especially under relatively weak interconnection. In this paper, we propose DeTransformer, a communication-efficient distributed Transformer inference system. The key idea of DeTransformer involves the co-design of Transformer architecture to reduce the communication during distributed inference. In detail, DeTransformer is based on a novel block parallelism approach, which restructures the original Transformer layer with a single block to the decoupled layer with multiple sub-blocks. Thus, it can exploit model parallelism between sub-blocks. Next, DeTransformer contains an adaptive execution approach that strikes a trade-off among communication capability, computing power and memory budget over multiple devices. It incorporates a two-phase planning for execution, namely static planning and runtime planning. The static planning runs offline, containing a profiling procedure and a weight placement strategy before execution. The runtime planning dynamically determines the optimal parallel computing strategy from an expertly crafted search space based on real-time requests. Notably, this execution approach can adapt to heterogeneous devices by distributing workload based on devices’ computing capabilities. We conduct experiments for both auto-regressive and auto-encoder tasks of Transformer models. Experimental results show that DeTransformer can reduce distributed inference latency by up to 2.81× compared to the SOTA approach on 4 devices, while effectively maintaining task accuracy and a consistent model size.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"36 4","pages":"717-730"},"PeriodicalIF":5.6,"publicationDate":"2024-12-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143535515","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

High Performance Householder QR Factorization on Emerging GPU Architectures Using Tensor Cores 基于张量核的新型GPU架构的高性能户型QR分解

IF 5.6 2区计算机科学

IEEE Transactions on Parallel and Distributed Systems Pub Date : 2024-12-25 DOI: 10.1109/TPDS.2024.3522776

Yuhan Leng;Gaoyuan Zou;Hansheng Wang;Panruo Wu;Shaoshuai Zhang

{"title":"High Performance Householder QR Factorization on Emerging GPU Architectures Using Tensor Cores","authors":"Yuhan Leng;Gaoyuan Zou;Hansheng Wang;Panruo Wu;Shaoshuai Zhang","doi":"10.1109/TPDS.2024.3522776","DOIUrl":"https://doi.org/10.1109/TPDS.2024.3522776","url":null,"abstract":"Since 2017, NVIDIA GPUs have been equipped with specialized units known as Tensor Cores, which demonstrate remarkable efficiency in processing matrix multiplications (GEMMs). Beyond GEMMs, researchers have explored the potential applications of Tensor Cores in matrix factorization, such as QR factorization. However, the inside GEMMs in QR factorization are typically tall and skinny. Compared to compute-bound square GEMMs, these tall and skinny GEMMs are memory bound, leading to suboptimal performance on Tensor Cores. To solve this problem, we indicate the recursive QR factorization can convert the tall and skinny GEMMs to relatively square and large GEMMs, resulting in better performance on Tensor Cores. Besides, we extend the FP16 Tensor-Cores-based QR factorization to accommodate FP32 and FP64 on FP16 and INT8 Tensor Cores, respectively. Additionally, to address the issue of orthogonality loss in the preceding Tensor Cores-based QR factorization, we transition from the Gram-Schmidt to the Householder algorithm while preserving high performance. According to our experimental evaluation conducted on NVIDIA's A100 and GeForce RTX 3090 GPU, the precision levels of FP64, FP32, and FP16 are up to 6.22x, 8.67x, and 4.03x faster, respectively, than the current state-of-the-art implementations.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"36 3","pages":"422-436"},"PeriodicalIF":5.6,"publicationDate":"2024-12-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143105824","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Integrated and Fungible Scheduling of Deep Learning Workloads Using Multi-Agent Reinforcement Learning 基于多智能体强化学习的深度学习工作负载集成可替换调度

IF 5.6 2区计算机科学

IEEE Transactions on Parallel and Distributed Systems Pub Date : 2024-12-25 DOI: 10.1109/TPDS.2024.3522333

Jialun Li;Danyang Xiao;Diying Yang;Xuan Mo;Weigang Wu

{"title":"Integrated and Fungible Scheduling of Deep Learning Workloads Using Multi-Agent Reinforcement Learning","authors":"Jialun Li;Danyang Xiao;Diying Yang;Xuan Mo;Weigang Wu","doi":"10.1109/TPDS.2024.3522333","DOIUrl":"https://doi.org/10.1109/TPDS.2024.3522333","url":null,"abstract":"GPU clusters have been widely used to co-locate various deep learning (DL) workloads in a multi-tenant way. Although such resource sharing can significantly reduce training cost, resource contention and interference among co-located workloads make task scheduling very complex and challenging. To simplify the scheduling problem, existing algorithms usually divide the procedure of scheduling into two sub-tasks, i.e., task placement and resource allocation, and allocate resources according to pre-defined and fixed resource demands. However, such a paradigm significantly constrains the selection of potential scheduling solutions. In this article, we present MAIFS, a novel multi-agent reinforcement learning based scheduling algorithm that handles task placement and resource allocation integratedly, and allows fungible resource allocation based on resource sensitivity of DL workloads. The core of MAIFS lies in two mechanisms. The multi-agent attention mechanism is designed to learn and share inter-related resource state features observed from different agents, which enables agents to explore fungible resource allocation solutions. The dynamic coordination graph mechanism is designed for coordinating interactive task placement decisions of agents during integrated scheduling, so as to mitigate potential task conflicts. Simulated experiments using two large scale production DL workload traces and physical deployment experiments based on a Kubernetes based GPU cluster show that MAIFS can outperform state-of-the-art scheduling algorithms by up to 44% in terms of makespan and 46% in terms of job completion time (JCT).","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"36 3","pages":"391-406"},"PeriodicalIF":5.6,"publicationDate":"2024-12-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143105817","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0