{"title":"Cost-Effective and Low-Latency Data Placement in Edge Environment Based on PageRank-Inspired Regional Value","authors":"Pengwei Wang;Junye Qiao;Yuying Zhao;Zhijun Ding","doi":"10.1109/TPDS.2024.3506625","DOIUrl":"https://doi.org/10.1109/TPDS.2024.3506625","url":null,"abstract":"Edge storage offers low-latency services to users. However, due to strained edge resources and high costs, enterprises must choose the data that most warrant placement at the edge and place it in the right location. In practice, data exhibit temporal and spatial properties, and variability, which have a significant impact on their placement, but have been largely ignored in research. To address this, we introduce the concept of data temperature, which considers data characteristics over time and space. To consider the influence of spatial relevance among different regions for placing data, inspired by PageRank, we present a model using data temperature to assess the regional value of data, which effectively leverages collaboration within the edge storage system. We also propose a regional value-based algorithm (RVA) that minimizes cost while meeting user response time requirements. By taking into account the correlation between regions, the RVA can achieve lower latency than current methods when creating an equal or even smaller number of replicas. Experimental results validate the efficacy of the proposed method in terms of latency, success rate, and cost efficiency.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"36 2","pages":"185-196"},"PeriodicalIF":5.6,"publicationDate":"2024-11-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142810640","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Jialiang Han;Yudong Han;Xiang Jing;Gang Huang;Yun Ma
{"title":"DegaFL: Decentralized Gradient Aggregation for Cross-Silo Federated Learning","authors":"Jialiang Han;Yudong Han;Xiang Jing;Gang Huang;Yun Ma","doi":"10.1109/TPDS.2024.3501581","DOIUrl":"https://doi.org/10.1109/TPDS.2024.3501581","url":null,"abstract":"Federated learning (FL) is an emerging promising paradigm of privacy-preserving machine learning (ML). An important type of FL is cross-silo FL, which enables a moderate number of organizations to cooperatively train a shared model by keeping confidential data locally and aggregating gradients on a central parameter server. However, the central server may be vulnerable to malicious attacks or software failures in practice. To address this issue, in this paper, we propose \u0000<inline-formula><tex-math>$mathtt{DegaFL} $</tex-math></inline-formula>\u0000, a novel decentralized gradient aggregation approach for cross-silo FL. \u0000<inline-formula><tex-math>$mathtt{DegaFL} $</tex-math></inline-formula>\u0000 eliminates the central server by aggregating gradients on each participant, and maintains and synchronizes gradients of only the current training round. Besides, we propose \u0000<inline-formula><tex-math>$mathtt{AdaAgg} $</tex-math></inline-formula>\u0000 to adaptively aggregate correct gradients from honest nodes and use HotStuff to ensure the consistency of the training round number and gradients among all nodes. Experimental results show that \u0000<inline-formula><tex-math>$mathtt{DegaFL} $</tex-math></inline-formula>\u0000 defends against common threat models with minimal accuracy loss, and achieves up to \u0000<inline-formula><tex-math>$50times$</tex-math></inline-formula>\u0000 reduction in storage overhead and up to \u0000<inline-formula><tex-math>$13times$</tex-math></inline-formula>\u0000 reduction in network overhead, compared to state-of-the-art decentralized FL approaches.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"36 2","pages":"212-225"},"PeriodicalIF":5.6,"publicationDate":"2024-11-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142810642","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Shuai Lin;Rui Wang;Yongkun Li;Yinlong Xu;John C. S. Lui
{"title":"Two-Dimensional Balanced Partitioning and Efficient Caching for Distributed Graph Analysis","authors":"Shuai Lin;Rui Wang;Yongkun Li;Yinlong Xu;John C. S. Lui","doi":"10.1109/TPDS.2024.3501292","DOIUrl":"https://doi.org/10.1109/TPDS.2024.3501292","url":null,"abstract":"Distributed graph analysis usually partitions a large graph into multiple small-sized subgraphs and distributes them into a cluster of machines for computing. Therefore, graph partitioning plays a crucial role in distributed graph analysis. However, the widely used existing graph partitioning schemes balance only in one dimension (number of edges or vertices) or incur a large number of edge cuts, so they degrade the performance of distributed graph analysis. In this article, we propose a novel graph partition scheme BPart and two enhanced algorithms BPart-C and BPart-S to achieve a balanced partition for both vertices and edges, and also reduce the number of edge cuts. Besides, we also propose a neighbor-aware caching scheme to further reduce the number of edge cuts so as to improve the efficiency of distributed graph analysis. Our experimental results show that BPart-C and BPart-S can achieve a better balance in both dimensions (the number of vertices and edges), and meanwhile reducing the number of edge cuts, compared to multiple existing graph partitioning algorithms, i.e., Chunk-V, Chunk-E, Fennel, and Hash. We also integrate these partitioning algorithms into two popular distributed graph systems, KnightKing and Gemini, to validate their impact on graph analysis efficiency. Results show that both BPart-C and BPart-S can significantly reduce the total running time of various graph applications by up to 60% and 70%, respectively. In addition, the neighbor-aware caching scheme can further improve the performance by up to 24%.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"36 2","pages":"133-149"},"PeriodicalIF":5.6,"publicationDate":"2024-11-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142810683","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Spreeze: High-Throughput Parallel Reinforcement Learning Framework","authors":"Jing Hou;Guang Chen;Ruiqi Zhang;Zhijun Li;Shangding Gu;Changjun Jiang","doi":"10.1109/TPDS.2024.3497986","DOIUrl":"https://doi.org/10.1109/TPDS.2024.3497986","url":null,"abstract":"The promotion of large-scale applications of reinforcement learning (RL) requires efficient training computation. While existing parallel RL frameworks encompass a variety of RL algorithms and parallelization techniques, the excessively burdensome communication frameworks hinder the attainment of the hardware's limit for final throughput and training effects on a single desktop. In this article, we propose Spreeze, a lightweight parallel framework for RL that efficiently utilizes a single desktop hardware resource to approach the throughput limit. We asynchronously parallelize the experience sampling, network update, performance evaluation, and visualization operations, and employ multiple efficient data transmission techniques to transfer various types of data between processes. The framework can automatically adjust the parallelization hyperparameters based on the computing ability of the hardware device in order to perform efficient large-batch updates. Based on the characteristics of the “Actor-Critic” RL algorithm, our framework uses dual GPUs to independently update the network of actors and critics in order to further improve throughput. Simulation results show that our framework can achieve up to 15,000 Hz experience sampling and 370,000 Hz network update frame rate using only a personal desktop computer, which is an order of magnitude higher than other mainstream parallel RL frameworks, resulting in a 73% reduction of training time. Our work on fully utilizing the hardware resources of a single desktop computer is fundamental to enabling efficient large-scale distributed RL training.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"36 2","pages":"282-292"},"PeriodicalIF":5.6,"publicationDate":"2024-11-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142890380","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"The Impact of Service Demand Variability on Data Center Performance","authors":"Diletta Olliaro;Adityo Anggraito;Marco Ajmone Marsan;Simonetta Balsamo;Andrea Marin","doi":"10.1109/TPDS.2024.3497792","DOIUrl":"https://doi.org/10.1109/TPDS.2024.3497792","url":null,"abstract":"Modern data centers feature an extensive array of cores that handle quite a diverse range of jobs. Recent traces, shared by leading cloud data center enterprises like Google and Alibaba, reveal that the constant increase in data center services and computational power is accompanied by a growing variability in service demand requirements. The number of cores needed for a job can vary widely, ranging from one to several thousands, and the number of seconds a core is held by a job can span more than five orders of magnitude. In this context of extreme variability, the policies governing the allocation of cores to jobs play a crucial role in the performance of data centers. It is widely acknowledged that the First-In First-Out (FIFO) policy tends to underutilize available computing capacity due to the varying magnitudes of core requests. However, the impact of the extreme variability in service demands on job waiting and response times, that has been deeply investigated in traditional queuing models, is not as well understood in the case of data centers, as we will show. To address this issue, we investigate the dynamics of a data center cluster through analytical models in simple cases, and discrete event simulations based on real data. Our findings emphasize the significant impact of service demand variability, both in terms of requested cores and service times, and allow us to provide insight for enhancing data center performance. In particular, we show how data center performance can be improved thanks to the control of the interplay between service and waiting times through the assignment of cores to jobs.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"36 2","pages":"120-132"},"PeriodicalIF":5.6,"publicationDate":"2024-11-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10753043","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142810629","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Retrospecting Available CPU Resources: SMT-Aware Scheduling to Prevent SLA Violations in Data Centers","authors":"Haoyu Liao;Tong-yu Liu;Jianmei Guo;Bo Huang;Dingyu Yang;Jonathan Ding","doi":"10.1109/TPDS.2024.3494879","DOIUrl":"https://doi.org/10.1109/TPDS.2024.3494879","url":null,"abstract":"The article focuses on an understudied yet fundamental problem: existing methods typically average the utilization of multiple hardware threads to evaluate the available CPU resources. However, the approach could underestimate the actual usage of the underlying physical core for Simultaneous Multi-Threading (SMT) processors, leading to an overestimation of remaining resources. The overestimation propagates from microarchitecture to operating systems and cloud schedulers, which may misguide scheduling decisions, exacerbate CPU overcommitment, and increase Service Level Agreement (SLA) violations. To address the potential overestimation problem, we propose an SMT-aware and purely data-driven approach named \u0000<italic>Remaining CPU</i>\u0000 (RCPU) that reserves more CPU resources to restrict CPU overcommitment and prevent SLA violations. RCPU requires only a few modifications to the existing cloud infrastructures and can be scaled up to large data centers. Extensive evaluations in the data center proved that RCPU contributes to a reduction of SLA violations by 18% on average for 98% of all latency-sensitive applications. Under a benchmarking experiment, we prove that RCPU increases the accuracy by 69% in terms of Mean Absolute Error (MAE) compared to the state-of-the-art.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"36 1","pages":"67-83"},"PeriodicalIF":5.6,"publicationDate":"2024-11-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142736347","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Balanced Splitting: A Framework for Achieving Zero-Wait in the Multiserver-Job Model","authors":"Jonatha Anselmi;Josu Doncel","doi":"10.1109/TPDS.2024.3493631","DOIUrl":"https://doi.org/10.1109/TPDS.2024.3493631","url":null,"abstract":"We present a new framework for designing nonpreemptive and job-size oblivious scheduling policies in the multiserver-job queueing model. The main requirement is to identify a \u0000<i>static and balanced sub-partition</i>\u0000 of the server set and ensure that the servers in each set of that sub-partition can only handle jobs of a given \u0000<i>class</i>\u0000 and in a first-come first-served order. A job class is determined by the number of servers to which it has exclusive access during its entire execution and the probability distribution of its service time. This approach aims to reduce delays by preventing small jobs from being blocked by larger ones that arrived first, and it is particularly beneficial when the job size variability intra resp. inter classes is small resp. large. In this setting, we propose a new scheduling policy, Balanced-Splitting. In our main results, we provide a sufficient condition for the stability of Balanced-Splitting and show that the resulting queueing probability, i.e., the probability that an arriving job needs to wait for processing upon arrival, vanishes in both the subcritical (the load is kept fixed to a constant less than one) and critical (the load approaches one from below) many-server limiting regimes. Crucial to our analysis is a connection with the M/GI/\u0000<inline-formula><tex-math>$s$</tex-math></inline-formula>\u0000/\u0000<inline-formula><tex-math>$s$</tex-math></inline-formula>\u0000 queue and Erlang’s loss formula, which allows our analysis to rely on fundamental results from queueing theory. Numerical simulations show that the proposed policy performs better than several preemptive/nonpreemptive size-aware/oblivious policies in various practical scenarios. This is also confirmed by simulations running on real traces from High Performance Computing (HPC) workloads. The delays induced by Balanced-Splitting are also competitive with those induced by state-of-the-art policies such as First-Fit-SRPT and ServerFilling-SRPT, though our approach has the advantage of not requiring preemption, nor the knowledge of job sizes.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"36 1","pages":"43-54"},"PeriodicalIF":5.6,"publicationDate":"2024-11-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142672046","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Ruikun Luo;Qiang He;Mengxi Xu;Feifei Chen;Song Wu;Jing Yang;Yuan Gao;Hai Jin
{"title":"Edge Data Deduplication Under Uncertainties: A Robust Optimization Approach","authors":"Ruikun Luo;Qiang He;Mengxi Xu;Feifei Chen;Song Wu;Jing Yang;Yuan Gao;Hai Jin","doi":"10.1109/TPDS.2024.3493959","DOIUrl":"https://doi.org/10.1109/TPDS.2024.3493959","url":null,"abstract":"The emergence of \u0000<italic>mobile edge computing</i>\u0000 (MEC) in distributed systems has sparked increased attention toward edge data management. A conflict arises from the disparity between limited edge resources and the continuously expanding data requests for data storage, making the reduction of data storage costs a critical objective. Despite the extensive studies of edge data deduplication as a data reduction technique, existing deduplication methods encounter numerous challenges in MEC environments. These challenges stem from disparities between edge servers and cloud data center edge servers, as well as uncertainties such as user mobility, leading to insufficient robustness in deduplication decision-making. Consequently, this paper presents a robust optimization-based approach for the edge data deduplication problem. By accounting for uncertainties including the number of data requirements and edge server failures, we propose two distinct solving algorithms: uEDDE-C, a two-stage algorithm based on column-and-constraint generation, and uEDDE-A, an approximation algorithm to address the high computation overhead of uEDDE-C. Our method facilitates efficient data deduplication in volatile edge network environments and maintains robustness across various uncertain scenarios. We validate the performance and robustness of uEDDE-C and uEDDE-A through theoretical analysis and experimental evaluations. The extensive experimental results demonstrate that our approach significantly reduces data storage cost and data retrieval latency while ensuring reliability in real-world MEC environments.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"36 1","pages":"84-95"},"PeriodicalIF":5.6,"publicationDate":"2024-11-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10747105","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142736348","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Ruikun Luo;Qiang He;Feifei Chen;Song Wu;Hai Jin;Yun Yang
{"title":"Ripple: Enabling Decentralized Data Deduplication at the Edge","authors":"Ruikun Luo;Qiang He;Feifei Chen;Song Wu;Hai Jin;Yun Yang","doi":"10.1109/TPDS.2024.3493953","DOIUrl":"https://doi.org/10.1109/TPDS.2024.3493953","url":null,"abstract":"With its advantages in ensuring low data retrieval latency and reducing backhaul network traffic, edge computing is becoming a backbone solution for many latency-sensitive applications. An increasingly large number of data is being generated at the edge, stretching the limited capacity of edge storage systems. Improving resource utilization for edge storage systems has become a significant challenge in recent years. Existing solutions attempt to achieve this goal through data placement optimization, data partitioning, data sharing, etc. These approaches overlook the data redundancy in edge storage systems, which produces substantial storage resource wastage. This motivates the need for an approach for data deduplication at the edge. However, existing data deduplication methods rely on centralized control, which is not always feasible in practical edge computing environments. This article presents Ripple, the first approach that enables edge servers to deduplicate their data in a decentralized manner. At its core, it builds a data index for each edge server, enabling them to deduplicate data without central control. With Ripple, edge servers can 1) identify data duplicates; 2) remove redundant data without violating data retrieval latency constraints; and 3) ensure data availability after deduplication. The results of trace-driven experiments conducted in a testbed system demonstrate the usefulness of Ripple in practice. Compared with the state-of-the-art approach, Ripple improves the deduplication ratio by up to 16.79% and reduces data retrieval latency by an average of 60.42%.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"36 1","pages":"55-66"},"PeriodicalIF":5.6,"publicationDate":"2024-11-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10747114","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142672007","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Qiang He;Guobiao Zhang;Jiawei Wang;Ruikun Luo;Xiaohai Dai;Yuchong Hu;Feifei Chen;Hai Jin;Yun Yang
{"title":"EdgeHydra: Fault-Tolerant Edge Data Distribution Based on Erasure Coding","authors":"Qiang He;Guobiao Zhang;Jiawei Wang;Ruikun Luo;Xiaohai Dai;Yuchong Hu;Feifei Chen;Hai Jin;Yun Yang","doi":"10.1109/TPDS.2024.3493034","DOIUrl":"https://doi.org/10.1109/TPDS.2024.3493034","url":null,"abstract":"In the edge computing environment, app vendors can distribute popular data from the cloud to edge servers to provide low-latency data retrieval. A key problem is how to distribute these data from the cloud to edge servers cost-effectively. Under current schemes, a file is divided into some data blocks for parallel transmissions from the cloud to target edge servers. Edge servers can then combine received data blocks to reconstruct the file. While this method expedites the data distribution process, it presents potential drawbacks. It is sensitive to transmission delays and transmission failures caused by runtime exceptions like network fluctuations and server failures. This paper presents EdgeHydra, the first edge data distribution scheme that tackles this challenge through fault tolerance based on erasure coding. Under EdgeHydra, a file is encoded into data blocks and parity blocks for parallel transmission from the cloud to target edge servers. An edge server can reconstruct the file upon the receipt of a sufficient number of these blocks without having to wait for all the blocks in transmission. It also innovatively employs a leaderless block supplement mechanism to ensure the receipt of sufficient blocks for individual target edge servers. These improve the robustness of the data distribution process significantly. Extensive experiments show that EdgeHydra can tolerate delays and failures in individual transmission links effectively, outperforming the state-of-the-art scheme by up to 50.54% in distribution time.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"36 1","pages":"29-42"},"PeriodicalIF":5.6,"publicationDate":"2024-11-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10746622","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142672047","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}