Lina Su, Ruiting Zhou, Ne Wang, Guang Fang, Zong-Qiang Li
{"title":"An Online Learning Approach for Client Selection in Federated Edge Learning under Budget Constraint","authors":"Lina Su, Ruiting Zhou, Ne Wang, Guang Fang, Zong-Qiang Li","doi":"10.1145/3545008.3545062","DOIUrl":"https://doi.org/10.1145/3545008.3545062","url":null,"abstract":"Federated learning (FL) has emerged as a new paradigm that enables distributed mobile devices to learn a global model collaboratively. Since mobile devices (a.k.a, clients) exhibit diversity in model training quality, client selection (CS) becomes critical for efficient FL. CS faces the following challenges: First, the client’s availability, the training data volumes, and the network connection status are time-varying and cannot be easily predicted. Second, clients for training and the number of local iterations would seriously affect the model accuracy. Thus, selecting a subset of available clients and controlling local iterations should guarantee model quality. Third, renting clients for model training needs cost. It is necessary to dynamically administrate the use of the long-term budget without knowledge of future inputs. To this end, we propose a federated edge learning (FedL) framework, which can select appropriate clients and control the number of training iterations in real-time. FedL aims to reduce the completion time while reaching the desired model convergence and satisfying the long-term budget for renting clients. FedL consists of two algorithms: i) the online learning algorithm makes CS and iteration decisions according to historic learning results; ii) the online rounding algorithm translates fractional decisions derived by the online learning algorithm into integers to satisfy feasibility constraints. Rigorous mathematical proof reveals that dynamic regret and dynamic fit have sub-linear upper-bounds with time for a given budget. Extensive experiments based on realistic datasets suggest that FedL outperforms multiple state-of-the-art algorithms. In particular, FedL reduces at least 38% completion time compared with others.","PeriodicalId":360504,"journal":{"name":"Proceedings of the 51st International Conference on Parallel Processing","volume":"56 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-08-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127304597","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Taylan Özden, Tim Beringer, Arya Mazaheri, H. M. Fard, F. Wolf
{"title":"ElastiSim: A Batch-System Simulator for Malleable Workloads","authors":"Taylan Özden, Tim Beringer, Arya Mazaheri, H. M. Fard, F. Wolf","doi":"10.1145/3545008.3545046","DOIUrl":"https://doi.org/10.1145/3545008.3545046","url":null,"abstract":"As high-performance computing infrastructures move towards exascale, the role of resource and job management systems is more critical now than ever. Simulating batch systems to improve scheduling algorithms and resource management efficiency is an indispensable option, as running large-scale experiments is expensive and time-consuming. Batch-system simulators are responsible for simulating the computing infrastructure and the types of jobs that constitute the workload. In contrast to rigid jobs, malleable jobs can dynamically reconfigure their resources during runtime. Although studies indicate that malleability can improve system performance, no simulator exists to investigate malleable scheduling policies. In this work, we present ElastiSim, a batch-system simulator supporting the combined scheduling of rigid and malleable jobs. To facilitate the simulation, we propose a malleable workload model and introduce a scheduling protocol that enables the evaluation of topology-, I/O-, and progress-aware scheduling algorithms. We validate the scaling behavior of our workload model by comparing training runtimes of various deep-learning models against the results achieved by ElastiSim. We use real-world cluster trace files to generate workloads and simulate various scheduling algorithms (FCFS, SJF, DRF, SRTF) to analyze their implications on the simulated platform. The results demonstrate that real-world executions show the same scaling behavior as our proposed workload model. We further show that ElastiSim can capture the complex interplay between emerging workloads and modern platforms to support algorithm designers by providing consistently meaningful results. ElastiSim is publicly available as an open-source project on https://github.com/elastisim.","PeriodicalId":360504,"journal":{"name":"Proceedings of the 51st International Conference on Parallel Processing","volume":"48 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-08-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125046725","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Guangming Cui, Qiang He, Xiaoyu Xia, Feifei Chen, Yun Yang
{"title":"Energy-efficient Edge Server Management for Edge Computing: A Game-theoretical Approach","authors":"Guangming Cui, Qiang He, Xiaoyu Xia, Feifei Chen, Yun Yang","doi":"10.1145/3545008.3545079","DOIUrl":"https://doi.org/10.1145/3545008.3545079","url":null,"abstract":"Similar to cloud servers which are well-known energy consumers, edge servers running 24/7 jointly consume a tremendous amount of energy and thus require energy-saving management. However, the unique characteristics of edge computing make it a new and challenging problem to manage edge servers in an energy-efficient manner. First, an individual edge server is usually used to serve a specific region. The temporal distribution of end-users in the area impacts the edge server’s energy utilization. Second, multiple base stations may cover an end-user simultaneously and the end-user can be served by the physical machines attached to any of the base stations. Serving the end-users in an area with minimum physical machines can minimize the edge servers’ overall energy consumption. Third, physical machines facilitating an edge server can be powered off individually when not needed to minimize the edge server’s energy consumption. We formulate this Energy-efficient Edge Server Management (EESM) problem and analyze its problem hardness. Next, a game-theoretical approach, i.e., EESM-G, is proposed to address EESM problems efficiently. The superior performance of EESM-G is tested on a public real-world dataset.","PeriodicalId":360504,"journal":{"name":"Proceedings of the 51st International Conference on Parallel Processing","volume":"12 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-08-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125763079","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"BSCache: A Brisk Semantic Caching Scheme for Cloud-based Performance Monitoring Timeseries Systems","authors":"Kai Zhang, Zhiqi Wang, Z. Shao","doi":"10.1145/3545008.3546183","DOIUrl":"https://doi.org/10.1145/3545008.3546183","url":null,"abstract":"Cloud-based performance monitoring timeseries systems are emerging due to cloud’s flexibility and pay-as-you-go capabilities. For such systems, caching is particularly important considering the limited bandwidth and long access latency of the cloud storage. However, existing cache schemes, such as with external cache systems (e.g. Memcached or Redis), are not specially designed for timeseries data and thus provide suboptimal performance. In this paper, we propose BSCache, a novel lightweight semantic cache mechanism for cloud-based performance monitoring timeseries systems, which is a variant of semantic cache specially designed for timeseries workloads. BSCache supports semantic-aware, metadata-data mixed in-memory management so it can significantly improve timeseries query performance. We have implemented a fully-functional, open-source prototype of BSCache and integrated it into Cortex, a distributed performance monitoring timeseries system widely adopted in industry. BSCache is compared with Memcached which is the default caching system in Cortex. Experimental results show that BSCache can significantly improve query performance with higher cache hit ratios and less CPU overhead under the same cache sizes compared with Memcached in Cortex.","PeriodicalId":360504,"journal":{"name":"Proceedings of the 51st International Conference on Parallel Processing","volume":"10 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-08-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122659444","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yuanzhang Wang, Fengkui Yang, Ji Zhang, Chun-hua Li, Ke Zhou, Chong Liu, Zhuo Cheng, Wei Fang, Jinhu Liu
{"title":"LDPP: A Learned Directory Placement Policy in Distributed File Systems","authors":"Yuanzhang Wang, Fengkui Yang, Ji Zhang, Chun-hua Li, Ke Zhou, Chong Liu, Zhuo Cheng, Wei Fang, Jinhu Liu","doi":"10.1145/3545008.3545057","DOIUrl":"https://doi.org/10.1145/3545008.3545057","url":null,"abstract":"Load balance is a critical problem in distributed file systems. Previous works focus on how to distribute data evenly on different nodes or storage devices from the perspective of file level, but neglect to effectively take advantage of the directory’s locality and the long duration of the directory’s hotness, which may affect the degree of balance and cause performance degradation. To overcome this shortcoming, in this paper, we propose a learning-based directory placement policy, called LDPP, which determines the data layout by predicting the load. We first establish a relationship between directory request characteristics and state information to predict the state information of the directory (storage capacity, bandwidth, and IOPS). Then, the new directory is placed on different nodes in a multi-dimensional manner based on the Manhattan distance according to the predicted multidimensional state information. In addition, we also take into account the trade-off between the same category directory classified by the load prediction module and the peer directories and explore their influence on the balance. Extensive experiments demonstrate that LDPP not only efficiently alleviates load imbalance and increases the utilization of the resources but also improves DFS performance in practice, which can reduce service latency by up to 36 and increase IOPS and bandwidth by 8 and 9, respectively.","PeriodicalId":360504,"journal":{"name":"Proceedings of the 51st International Conference on Parallel Processing","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-08-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129195168","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"SMEGA2: Distributed Asynchronous Deep Neural Network Training With a Single Momentum Buffer","authors":"Refael Cohen, Ido Hakimi, A. Schuster","doi":"10.1145/3545008.3545010","DOIUrl":"https://doi.org/10.1145/3545008.3545010","url":null,"abstract":"As the field of deep learning progresses, and neural networks become larger, training them has become a demanding and time consuming task. To tackle this problem, distributed deep learning must be used to scale the training of deep neural networks to many workers. Synchronous algorithms, commonly used for distributing the training, are susceptible to faulty or straggling workers. Asynchronous algorithms do not suffer from the problems of synchronization, but introduce a new problem known as staleness. Staleness is caused by applying out-of-date gradients, and it can greatly hinder the convergence process. Furthermore, asynchronous algorithms that incorporate momentum often require keeping a separate momentum buffer for each worker, which cost additional memory proportional to the number of workers. We introduce a new asynchronous method, SMEGA2, which requires a single momentum buffer regardless of the number of workers. Our method works in a way that lets us estimate the future position of the parameters, thereby minimizing the staleness effect. We evaluate our method on the CIFAR and ImageNet datasets, and show that SMEGA2 outperforms existing methods in terms of final test accuracy while scaling up to as much as 64 asynchronous workers. Open-Source Code: https://github.com/rafi-cohen/SMEGA2","PeriodicalId":360504,"journal":{"name":"Proceedings of the 51st International Conference on Parallel Processing","volume":"19 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-08-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124091583","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Shuai Lin, Rui Wang, Yongkun Li, Yinlong Xu, John C.S. Lui, Fei Chen, Pengcheng Wang, Lei Han
{"title":"Towards Fast Large-scale Graph Analysis via Two-dimensional Balanced Partitioning","authors":"Shuai Lin, Rui Wang, Yongkun Li, Yinlong Xu, John C.S. Lui, Fei Chen, Pengcheng Wang, Lei Han","doi":"10.1145/3545008.3545060","DOIUrl":"https://doi.org/10.1145/3545008.3545060","url":null,"abstract":"Distributed graph systems often leverage a cluster of machines by partitioning a large graph into multiple small-size subgraphs. Thus, graph partition usually has a significant impact on the performance of distributed graph systems. However, existing widely used partition schemes in practical graph systems can realize a good balance only in one dimension, e.g., either the number of vertices or the number of edges, and they may also incur lots of edge cuts. To address the problem, we develop BPart, which adopts a two-phase partition scheme to realize two-dimensional balance for both vertices and edges. Its core idea is to first partition the original graph into more small pieces than the cluster scale, and combine the partition to realize desired properties, then selectively combine the small pieces to construct larger subgraphs to generate two-dimensional balanced partition. We implement BPart into two open-source distributed graph systems, Gemini [58] and KnightKing [57]. Results show that BPart realizes good balance in both dimensions, and also significantly reduces the number of edge cuts. As a result, BPart reduces the total running time of various graph applications by 5% - 70%, compared to multiple existing partition schemes, e.g., Chunk-V, Chunk-E, Fennel, and Hash.","PeriodicalId":360504,"journal":{"name":"Proceedings of the 51st International Conference on Parallel Processing","volume":"78 1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-08-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121918340","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Boosting Cross-rack Multi-stripe Repair in Heterogeneous Erasure-coded Clusters","authors":"H. Zhou, D. Feng","doi":"10.1145/3545008.3545029","DOIUrl":"https://doi.org/10.1145/3545008.3545029","url":null,"abstract":"Large-scale distributed storage systems have introduced erasure code to guarantee high data reliability, yet inevitably at the expense of high repair costs. In practice, storage nodes are usually divided into different racks, and data blocks in storage nodes are often organized into multiple stripes independently manipulated by erasure code. Due to the scarcity and heterogeneity of the cross-rack bandwidth, the cross-rack network transmission dominates the entire repair costs. We argue that when erasure code is deployed in a rack architecture, existing repair techniques are limited in different aspects: neglecting the heterogeneous cross-rack bandwidth, less consideration for multi-stripe failure, no special treatment on repair link scheduling, and only targeting specific erasure code constructions. In this paper, we present CMRepair, an efficient Cross-rack Multi-stripe Repair technique that aims to reduce the repair time for multi-stripes failure repair in heterogeneous erasure-coded clusters. CMRepair carefully chooses the nodes for reading/repairing blocks and greedily searches for the near-optimal multi-stripe repair solution that reduces the cross-rack repair time while only introducing negligible computational overhead. Furthermore, it selectively schedules the execution orders of cross-rack links, with the primary objective of saturating the unused upload/download bandwidth resources and avoiding network congestion. CMRepair can also be extended to tackle full-node repair, multi-failure repair, and adapt to different erasure codes. Experiments show that CMRepair can reduce 6.42%-62.50% of the cross-rack repair time and improve 24.94%-53.91% of the repair throughput.","PeriodicalId":360504,"journal":{"name":"Proceedings of the 51st International Conference on Parallel Processing","volume":"136 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-08-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134145494","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yuhao Liu, Xin Du, Zhihui Lu, Qiang Duan, Jianfeng Feng, Ming-zhi Wang, Jie Wu
{"title":"Regularizing Sparse and Imbalanced Communications for Voxel-based Brain Simulations on Supercomputers","authors":"Yuhao Liu, Xin Du, Zhihui Lu, Qiang Duan, Jianfeng Feng, Ming-zhi Wang, Jie Wu","doi":"10.1145/3545008.3545019","DOIUrl":"https://doi.org/10.1145/3545008.3545019","url":null,"abstract":"Inter-process communications form a performance bottleneck for large-scale brain simulations. The sparse and imbalanced communication patterns of human brain make it particularly challenging to design a communication system for supporting large-scale brain simulations. In this paper, we tackle the communication challenges posed by large-scale brain simulations with sparse and imbalanced communication patterns. We design a virtual communication topology with a merge and forward algorithm that exploits the sparsity to regularize inter-process communications. To balance the communication loads of different processes, we formulate voxel partition in brain simulations as a k-way graph partition problem and propose a constrained deterministic greedy algorithm to solve the problem effectively. We conducted extensive simulation experiments for evaluating the performance of the proposed communication scheme and found that the proposed method may significantly reduce communication overheads and shorten simulation time for large-scale brain models.","PeriodicalId":360504,"journal":{"name":"Proceedings of the 51st International Conference on Parallel Processing","volume":"21 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-08-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133937361","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Scheduling Fork-Join Task Graphs with Communication Delays and Equal Processing Times","authors":"Huijun Wang, O. Sinnen","doi":"10.1145/3545008.3545036","DOIUrl":"https://doi.org/10.1145/3545008.3545036","url":null,"abstract":"Task scheduling for parallel computing is strongly NP-hard even without precedence constraints P||Cmax. With any kind of precedence constraints and communication delays the problem becomes less manageable still. We look at the specific case of scheduling under the precedence constraints of a fork-join structure (including communication delays) P|fork − join, cij|Cmax. This represents any kind of computation that divides into sub-computations with the end results being processed together, such as divide and conquer. This kind of computation is fundamental. We look at the instances where some of the computation and communication costs are constant, present polynomial time algorithms for them, and explore the boundary between tractability and NP-hardness around this problem.","PeriodicalId":360504,"journal":{"name":"Proceedings of the 51st International Conference on Parallel Processing","volume":"299 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-08-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132692218","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}