IEEE Transactions on Parallel and Distributed Systems最新文献

筛选
英文 中文
Trusted Model Aggregation With Zero-Knowledge Proofs in Federated Learning 联盟学习中的零知识证明可信模型聚合
IF 5.6 2区 计算机科学
IEEE Transactions on Parallel and Distributed Systems Pub Date : 2024-09-06 DOI: 10.1109/TPDS.2024.3455762
Renwen Ma;Kai Hwang;Mo Li;Yiming Miao
{"title":"Trusted Model Aggregation With Zero-Knowledge Proofs in Federated Learning","authors":"Renwen Ma;Kai Hwang;Mo Li;Yiming Miao","doi":"10.1109/TPDS.2024.3455762","DOIUrl":"10.1109/TPDS.2024.3455762","url":null,"abstract":"This paper proposes a new global model aggregation method based on using zero-knowledge federated learning (ZKFL). The purpose is to secure horizontal or P2P federated machine learning systems with shorter aggregation times, higher model accuracy, and lower system costs. We use a model parameter-sharing Chord overlay network among all client hosts. The overlay guarantees a trusted sharing of zero-knowledge proofs for aggregation integrity, even under malicious Byzantine attacks. We tested over popular datasets, Fashion-MNIST and CIFAR10, to prove the new system protection concept. Our benchmark experiments validate the claimed advantages of the ZKFL scheme in all objective functions. Our aggregation method can be applied to secure both rank-based and similarity-based aggregation schemes. For a large system with over 200 clients, our system takes only 3 seconds to yield high-precision global machine models under the ALIE attacks with the Fashion-MNIST dataset. We have achieved up to 85% model accuracy, compared to only 3%\u0000<inline-formula><tex-math>$sim$</tex-math></inline-formula>\u000045% accuracy observed with federated schemes without protection. Moreover, our method demands a low memory overhead for handling zero-knowledge proofs as the system scales greatly to a larger number of client nodes.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"35 11","pages":"2284-2296"},"PeriodicalIF":5.6,"publicationDate":"2024-09-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142194354","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
FedVeca: Federated Vectorized Averaging on Non-IID Data With Adaptive Bi-Directional Global Objective FedVeca:非 IID 数据的联合矢量化平均与自适应双向全局目标
IF 5.6 2区 计算机科学
IEEE Transactions on Parallel and Distributed Systems Pub Date : 2024-09-04 DOI: 10.1109/TPDS.2024.3454203
Ping Luo;Jieren Cheng;N. Xiong;Zhenhao Liu;Jie Wu
{"title":"FedVeca: Federated Vectorized Averaging on Non-IID Data With Adaptive Bi-Directional Global Objective","authors":"Ping Luo;Jieren Cheng;N. Xiong;Zhenhao Liu;Jie Wu","doi":"10.1109/TPDS.2024.3454203","DOIUrl":"10.1109/TPDS.2024.3454203","url":null,"abstract":"Federated Learning (FL) is a distributed machine learning framework in parallel and distributed systems. However, the systems’ Non-Independent and Identically Distributed (Non-IID) data negatively affect the communication efficiency, since clients with different datasets may cause significant gaps to the local gradients in each communication round. In this article, we propose a Federated Vectorized Averaging (FedVeca) method to optimize the FL communication system on Non-IID data. Specifically, we set a novel objective for the global model which is related to the local gradients. The local gradient is defined as a bi-directional vector with step size and direction, where the step size is the number of local updates and the direction is divided into positive and negative according to our definition. In FedVeca, the direction is influenced by the step size, thus we average the bi-directional vectors to reduce the effect of different step sizes. Then, we theoretically analyze the relationship between the step sizes and the global objective, and obtain upper bounds on the step sizes per communication round. Based on the upper bounds, we design an algorithm for the server and the client to adaptively adjusts the step sizes that make the objective close to the optimum. Finally, we conduct experiments on different datasets, models and scenarios by building a prototype system, and the experimental results demonstrate the effectiveness and efficiency of the FedVeca method.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"35 11","pages":"2102-2113"},"PeriodicalIF":5.6,"publicationDate":"2024-09-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142194356","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
High-Throughput GPU Implementation of Dilithium Post-Quantum Digital Signature 锂后量子数字签名的高吞吐量 GPU 实现
IF 5.6 2区 计算机科学
IEEE Transactions on Parallel and Distributed Systems Pub Date : 2024-09-03 DOI: 10.1109/TPDS.2024.3453289
Shiyu Shen;Hao Yang;Wangchen Dai;Hong Zhang;Zhe Liu;Yunlei Zhao
{"title":"High-Throughput GPU Implementation of Dilithium Post-Quantum Digital Signature","authors":"Shiyu Shen;Hao Yang;Wangchen Dai;Hong Zhang;Zhe Liu;Yunlei Zhao","doi":"10.1109/TPDS.2024.3453289","DOIUrl":"10.1109/TPDS.2024.3453289","url":null,"abstract":"Digital signatures are fundamental building blocks in various protocols to provide integrity and authenticity. The development of the quantum computing has raised concerns about the security guarantees afforded by classical signature schemes. CRYSTALS-Dilithium is an efficient post-quantum digital signature scheme based on lattice cryptography and has been selected as the primary algorithm for standardization by the National Institute of Standards and Technology. In this work, we present a high-throughput GPU implementation of Dilithium. For individual operations, we employ a range of computational and memory optimizations to overcome sequential constraints, reduce memory usage and IO latency, address bank conflicts, and mitigate pipeline stalls. This results in high and balanced compute throughput and memory throughput for each operation. In terms of concurrent task processing, we leverage task-level batching to fully utilize parallelism and implement a memory pool mechanism for rapid memory access. We propose a dynamic task scheduling mechanism to improve multiprocessor occupancy and significantly reduce execution time. Furthermore, we apply asynchronous computing and launch multiple streams to hide data transfer latencies and maximize the computing capabilities of both CPU and GPU. Across all three security levels, our GPU implementation achieves over 160× speedups for signing and over 80× speedups for verification on both commercial and server-grade GPUs. This achieves microsecond-level amortized execution times for each task, offering a high-throughput and quantum-resistant solution suitable for a wide array of applications in real systems.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"35 11","pages":"1964-1976"},"PeriodicalIF":5.6,"publicationDate":"2024-09-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142194358","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
SC-CGRA: An Energy-Efficient CGRA Using Stochastic Computing SC-CGRA:使用随机计算的高能效 CGRA
IF 5.6 2区 计算机科学
IEEE Transactions on Parallel and Distributed Systems Pub Date : 2024-09-03 DOI: 10.1109/TPDS.2024.3453310
Di Mou;Bo Wang;Dajiang Liu
{"title":"SC-CGRA: An Energy-Efficient CGRA Using Stochastic Computing","authors":"Di Mou;Bo Wang;Dajiang Liu","doi":"10.1109/TPDS.2024.3453310","DOIUrl":"10.1109/TPDS.2024.3453310","url":null,"abstract":"Stochastic Computing (SC) offers a promising computing paradigm for low-power and cost-effective applications, with the added advantage of high error tolerance. In parallel, Coarse-Grained Reconfigurable Arrays (CGRA) prove to be a highly promising platform for domain-specific applications due to their combination of energy efficiency and flexibility. Intuitively, introducing SC to CGRA would significantly reinforce the strengths of both paradigms. However, existing SC-based architectures often encounter inherent computation errors, while the stochastic number generators employed in SC result in exponentially growing latency, which is deemed unacceptable in CGRA. In this work, we propose an SC-based CGRA by replacing the exact multiplication in traditional CGRA with an SC-based multiplication. To improve the accuracy of SC and shorten the latency of Stochastic Number Generators (SNG), we introduce the leading zero shifting and comparator truncation, while keeping the length of bitstream fixed. In addition, due to the flexible interconnections among PEs, we propose a quality scaling strategy that combines neighbor PEs to achieve high-accuracy operations without switching costs like power-gating. Compared to the state-of-the-art approximate computing design of CGRA, our proposed CGRA can averagely achieve a 65.3% reduction in output error while having a 21.2% reduction in energy consumption and a noteworthy 28.37% area savings.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"35 11","pages":"2023-2038"},"PeriodicalIF":5.6,"publicationDate":"2024-09-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142194357","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Towards Efficient Graph Processing in Geo-Distributed Data Centers 在地理分布式数据中心实现高效图形处理
IF 5.6 2区 计算机科学
IEEE Transactions on Parallel and Distributed Systems Pub Date : 2024-09-03 DOI: 10.1109/TPDS.2024.3453872
Feng Yao;Qian Tao;Shengyuan Lin;Yanfeng Zhang;Wenyuan Yu;Shufeng Gong;Qiange Wang;Ge Yu;Jingren Zhou
{"title":"Towards Efficient Graph Processing in Geo-Distributed Data Centers","authors":"Feng Yao;Qian Tao;Shengyuan Lin;Yanfeng Zhang;Wenyuan Yu;Shufeng Gong;Qiange Wang;Ge Yu;Jingren Zhou","doi":"10.1109/TPDS.2024.3453872","DOIUrl":"10.1109/TPDS.2024.3453872","url":null,"abstract":"Iterative graph processing is widely used as a significant paradigm for large-scale data analysis. In many global businesses of multinational enterprises, graph-structure data is usually geographically distributed in different regions to support low-latency services. Geo-distributed graph processing suffers from the Wide Area Networks (WANs) with scarce and heterogeneous bandwidth, thus essentially differs from traditional distributed graph processing. In this paper, we propose RAGraph, a \u0000<i><u>R</u>egion-<u>A</u>ware framework for geo-distributed <u>graph</u> processing</i>\u0000. At the core of RAGraph, we design a region-aware graph processing framework that allows advancing inefficient global updates locally and enables sensible coordination-free message interactions and flexible replaceable communication module. In terms of graph data preprocessing, RAGraph introduces a contribution-driven edge migration algorithm to effectively utilize network resources. RAGraph also contains an adaptive hierarchical message interaction engine to switch interaction modes adaptively based on network heterogeneity and fluctuation, and a discrepancy-aware message filtering strategy to filter important messages. Experimental results show that RAGraph can achieve an average speedup of 9.7× (up to 98×) and an average WAN cost reduction of 78.5\u0000<inline-formula><tex-math>$%$</tex-math></inline-formula>\u0000 (up to 97.3\u0000<inline-formula><tex-math>$%$</tex-math></inline-formula>\u0000) compared with state-of-the-art systems.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"35 11","pages":"2147-2160"},"PeriodicalIF":5.6,"publicationDate":"2024-09-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142194363","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
ComboFunc: Joint Resource Combination and Container Placement for Serverless Function Scaling With Heterogeneous Container ComboFunc:联合资源组合与容器放置,实现无服务器功能与异构容器的扩展
IF 5.6 2区 计算机科学
IEEE Transactions on Parallel and Distributed Systems Pub Date : 2024-09-03 DOI: 10.1109/TPDS.2024.3454071
Zhaojie Wen;Qiong Chen;Quanfeng Deng;Yipei Niu;Zhen Song;Fangming Liu
{"title":"ComboFunc: Joint Resource Combination and Container Placement for Serverless Function Scaling With Heterogeneous Container","authors":"Zhaojie Wen;Qiong Chen;Quanfeng Deng;Yipei Niu;Zhen Song;Fangming Liu","doi":"10.1109/TPDS.2024.3454071","DOIUrl":"10.1109/TPDS.2024.3454071","url":null,"abstract":"Serverless computing provides developers with a maintenance-free approach to resource usage, but it also transfers resource management responsibility to the cloud platform. However, the fine granularity of serverless function resources can lead to performance bottlenecks and resource fragmentation on nodes when creating many function containers. This poses challenges in effectively scaling function resources and optimizing node resource allocation, hindering overall agility. To address these challenges, we have introduced ComboFunc, an innovative resource scaling system for serverless platforms. ComboFunc associates function with heterogeneous containers of varying specifications and optimizes their resource combination and placement. This approach not only selects appropriate nodes for container creation, but also leverages the new feature of Kubernetes In-place Pod Vertical Scaling to enhance resource scaling agility and efficiency. By allowing a single function to correspond to heterogeneous containers with varying resource specifications and providing the ability to modify the resource specifications of existing containers in place, ComboFunc effectively utilizes fragmented resources on nodes. This, in turn, enhances the overall resource utilization of the entire cluster and improves scaling agility. We also model the problem of combining and placing heterogeneous containers as an NP-hard problem and design a heuristic solution based on a greedy algorithm that solves it in polynomial time. We implemented a prototype of ComboFunc on the Kubernetes platform and conducted experiments using real traces on a local cluster. The results demonstrate that, compared to existing strategies, ComboFunc achieves up to 3.01 × faster function resource scaling and reduces resource costs by up to 42.6%.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"35 11","pages":"1989-2005"},"PeriodicalIF":5.6,"publicationDate":"2024-09-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142194387","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
CODE$^{+}$+: Fast and Accurate Inference for Compact Distributed IoT Data Collection CODE+:针对紧凑型分布式物联网数据采集的快速准确推理。
IF 5.6 2区 计算机科学
IEEE Transactions on Parallel and Distributed Systems Pub Date : 2024-09-03 DOI: 10.1109/TPDS.2024.3453607
Huali Lu;Feng Lyu;Ju Ren;Huaqing Wu;Conghao Zhou;Zhongyuan Liu;Yaoxue Zhang;Xuemin Shen
{"title":"CODE$^{+}$+: Fast and Accurate Inference for Compact Distributed IoT Data Collection","authors":"Huali Lu;Feng Lyu;Ju Ren;Huaqing Wu;Conghao Zhou;Zhongyuan Liu;Yaoxue Zhang;Xuemin Shen","doi":"10.1109/TPDS.2024.3453607","DOIUrl":"10.1109/TPDS.2024.3453607","url":null,"abstract":"In distributed IoT data systems, full-size data collection is impractical due to the energy constraints and large system scales. Our previous work has investigated the advantages of integrating matrix sampling and inference for compact distributed IoT data collection, to minimize the data collection cost while guaranteeing the data benefits. This paper further advances the technology by boosting fast and accurate inference for those distributed IoT data systems that are sensitive to computation time, training stability, and inference accuracy. Particularly, we propose \u0000<italic>CODE<inline-formula><tex-math>$^{+}$</tex-math><alternatives><mml:math><mml:msup><mml:mrow/><mml:mo>+</mml:mo></mml:msup></mml:math><inline-graphic></alternatives></inline-formula></i>\u0000, i.e., \u0000<underline>C</u>\u0000ompact Distributed I\u0000<underline>O</u>\u0000T \u0000<underline>D</u>\u0000ata Coll\u0000<underline>E</u>\u0000ction Plus, which features a cluster-based sampling module and a Convolutional Neural Network (CNN)-Transformer Autoencoders-based inference module, to reduce cost and guarantee the data benefits. The sampling component employs a cluster-based matrix sampling approach, in which data clustering is first conducted and then a two-step sampling is performed in accordance with the number of clusters and clustering errors. The inference component integrates a CNN-Transformer Autoencoders-based matrix inference model to estimate the full-size spatio-temporal data matrix, which consists of a CNN-Transformer encoder that extracts the underlying features from the sampled data matrix and a lightweight decoder that maps the learned latent features back to the original full-size data matrix. We implement \u0000<italic>CODE<inline-formula><tex-math>$^{+}$</tex-math><alternatives><mml:math><mml:msup><mml:mrow/><mml:mo>+</mml:mo></mml:msup></mml:math><inline-graphic></alternatives></inline-formula></i>\u0000 under three operational large-scale IoT systems and one synthetic Gaussian distribution dataset, and extensive experiments are provided to demonstrate its efficiency and robustness. With a 20% sampling ratio, \u0000<italic>CODE<inline-formula><tex-math>$^{+}$</tex-math><alternatives><mml:math><mml:msup><mml:mrow/><mml:mo>+</mml:mo></mml:msup></mml:math><inline-graphic></alternatives></inline-formula></i>\u0000 achieves an average data reconstruction accuracy of 94% across four datasets, outperforming our previous version of 87% and state-of-the-art baseline of 71%.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"35 11","pages":"2006-2022"},"PeriodicalIF":5.6,"publicationDate":"2024-09-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142194359","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Exploring the Design Space of Distributed Parallel Sparse Matrix-Multiple Vector Multiplication 探索分布式并行稀疏矩阵-多矢量乘法的设计空间
IF 5.6 2区 计算机科学
IEEE Transactions on Parallel and Distributed Systems Pub Date : 2024-08-30 DOI: 10.1109/TPDS.2024.3452478
Hua Huang;Edmond Chow
{"title":"Exploring the Design Space of Distributed Parallel Sparse Matrix-Multiple Vector Multiplication","authors":"Hua Huang;Edmond Chow","doi":"10.1109/TPDS.2024.3452478","DOIUrl":"10.1109/TPDS.2024.3452478","url":null,"abstract":"We consider the distributed memory parallel multiplication of a sparse matrix by a dense matrix (SpMM). The dense matrix is often a collection of dense vectors. Standard implementations will multiply the sparse matrix by multiple dense vectors at the same time, to exploit the computational efficiencies therein. But such approaches generally utilize the same sparse matrix partitioning as if multiplying by a single vector. This article explores the design space of parallelizing SpMM and shows that a coarser grain partitioning of the matrix combined with a column-wise partitioning of the block of vectors can often require less communication volume and achieve higher SpMM performance. An algorithm is presented that chooses a process grid geometry for a given number of processes to optimize the performance of parallel SpMM. The algorithm can augment existing graph partitioners by utilizing the additional concurrency available when multiplying by multiple dense vectors to further reduce communication.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"35 11","pages":"1977-1988"},"PeriodicalIF":5.6,"publicationDate":"2024-08-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142227710","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Beyond Belady to Attain a Seemingly Unattainable Byte Miss Ratio for Content Delivery Networks 超越 Belady,为内容交付网络实现看似遥不可及的字节遗漏率
IF 5.6 2区 计算机科学
IEEE Transactions on Parallel and Distributed Systems Pub Date : 2024-08-30 DOI: 10.1109/TPDS.2024.3452096
Peng Wang;Hong Jiang;Yu Liu;Zhelong Zhao;Ke Zhou;Zhihai Huang
{"title":"Beyond Belady to Attain a Seemingly Unattainable Byte Miss Ratio for Content Delivery Networks","authors":"Peng Wang;Hong Jiang;Yu Liu;Zhelong Zhao;Ke Zhou;Zhihai Huang","doi":"10.1109/TPDS.2024.3452096","DOIUrl":"https://doi.org/10.1109/TPDS.2024.3452096","url":null,"abstract":"Reducing the byte miss ratio (BMR) in the Content Delivery Network (CDN) caches can help providers save on the cost of paying for traffic. When evicting objects or files of different sizes in the caches of CDNs, it is no longer sufficient to pursue an optimal object miss ratio (OMR) by approximating Belady to ensure an optimal BMR. Our experimental observations suggest that there are multiple request sequence windows. In these windows, a replacement policy prioritizes the eviction of objects with large sizes and ultimately evicts the object with the longest reuse distance, lowering the BMR without increasing the OMR. To accurately capture those windows, we monitor the changes in OMR and BMR using a deep reinforcement learning (RL) model and then implement a BMR-friendly replacement algorithm in these windows. Based on this policy, we propose a Belady and Size Eviction (LRU-BaSE) algorithm that reduces BMR while maintaining OMR. To make LRU-BaSE efficient and practical, we address the feedback delay problem of RL with a two-pronged approach. On the one hand, we shorten the LRU-base decision region based on the observation that the rear section of the cache queue contains most of the eviction candidates. On the other hand, the request distribution on CDNs makes it feasible to divide the learning region into multiple sub-regions that are each learned with reduced time and increased accuracy. In real CDN systems, LRU-BaSE outperforms LRU by reducing “backing to OS” traffic and access latency by 30.05% and 17.07%, respectively, on average. In simulator tests, LRU-BaSE outperforms state-of-the-art cache replacement policies. On average, LRU-BaSE's BMR is 0.63% and 0.33% less than that of Belady and Practical Flow-based Offline Optimal (PFOO), respectively. In addition, compared to Learning Relaxed Belady (LRB), LRU-BaSE can yield relatively stable performance when facing workload drift.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"35 11","pages":"1949-1963"},"PeriodicalIF":5.6,"publicationDate":"2024-08-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142160006","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
BIRD+: Design of a Lightweight Communication Compressor for Resource-Constrained Distribution Learning Platforms BIRD+:为资源有限的分布式学习平台设计轻量级通信压缩器
IF 5.6 2区 计算机科学
IEEE Transactions on Parallel and Distributed Systems Pub Date : 2024-08-21 DOI: 10.1109/TPDS.2024.3447221
Donglei Wu;Weihao Yang;Xiangyu Zou;Hao Feng;Dingwen Tao;Shiyi Li;Wen Xia;Binxing Fang
{"title":"BIRD+: Design of a Lightweight Communication Compressor for Resource-Constrained Distribution Learning Platforms","authors":"Donglei Wu;Weihao Yang;Xiangyu Zou;Hao Feng;Dingwen Tao;Shiyi Li;Wen Xia;Binxing Fang","doi":"10.1109/TPDS.2024.3447221","DOIUrl":"10.1109/TPDS.2024.3447221","url":null,"abstract":"The Top-K sparsification-based compression framework is extensively explored for reducing communication costs in distributed learning. However, we identified several issues with existing Top-K sparsification-based compression methods: (\u0000<i>i</i>\u0000) The limited compressibility of the Top-K parameter's indexes critically restricts the overall communication compression ratio; (\u0000<i>ii</i>\u0000) Several time-consuming compression operations significantly offset the benefits of communication compression; (\u0000<i>iii</i>\u0000) The use of error feedback techniques to maintain model quality results in a high memory footprint consumption. To solve these issues, we propose BIRD, a lightweight tensor-wise \u0000<i>Bi-Random sampling</i>\u0000 strategy with an expectation invariance property. Specifically, BIRD applies a tensor-wise \u0000<i>index sharing</i>\u0000 mechanism that reduces the index proportion by allowing multiple tensor elements to share a single index, thus improving the overall compression ratio. Additionally, BIRD replaces the time-consuming Top-K sorting with a faster \u0000<i>Bi-Random sampling</i>\u0000 strategy based on the aforementioned \u0000<i>index sharing</i>\u0000 mechanism, significantly reducing compression overheads; Moreover, BIRD establishes an \u0000<i>expectation invariance</i>\u0000 property into the \u0000<i>Bi-Random sampling</i>\u0000 to ensure an approximate unbiased representation for the \u0000<inline-formula><tex-math>$L_1$</tex-math></inline-formula>\u0000-norm of the sampled tensors, effectively maintaining the model quality without incurring extra memory costs. We further optimize BIRD to BIRD+ by introducing the uniform distribution-based sampling and Gamma correction on the tensor-wise sampling process, achieving a more flexibly adjustment of the sparsity with better convergence performance. Experimental evaluations across multiple conventional distributed learning tasks demonstrate that compared to state-of-the-art approaches, BIRD+ achieves higher communication compression ratios up to 36.2\u0000<inline-formula><tex-math>$times$</tex-math></inline-formula>\u0000 and higher computation throughput up to 149.6\u0000<inline-formula><tex-math>$times$</tex-math></inline-formula>\u0000 while maintaining the model quality without incurring extra memory costs.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"35 11","pages":"2193-2207"},"PeriodicalIF":5.6,"publicationDate":"2024-08-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142194361","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
相关产品
×
本文献相关产品
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信