Rongxin Xu, Shiva Raj Pokhrel, Qiujun Lan, Gang Li
{"title":"FAIR-BFL: Flexible and Incentive Redesign for Blockchain-based Federated Learning","authors":"Rongxin Xu, Shiva Raj Pokhrel, Qiujun Lan, Gang Li","doi":"10.1145/3545008.3545040","DOIUrl":"https://doi.org/10.1145/3545008.3545040","url":null,"abstract":"Vanilla Federated learning (FL) relies on the centralized global aggregation mechanism and assumes that all clients are honest. This makes it a challenge for FL to alleviate the single point of failure and dishonest clients. These impending challenges in the design philosophy of FL call for blockchain-based federated learning (BFL) due to the benefits of coupling FL and blockchain (e.g., democracy, incentive, and immutability). However, one problem in vanilla BFL is that its capabilities do not follow adopters’ needs in a dynamic fashion. Besides, vanilla BFL relies on unverifiable clients’ self-reported contributions like data size because checking clients’ raw data is not allowed in FL for privacy concerns. We design and evaluate a novel BFL framework, and resolve the identified challenges in vanilla BFL with greater flexibility and incentive mechanism called FAIR-BFL. In contrast to existing works, FAIR-BFL offers unprecedented flexibility via the modular design, allowing adopters to adjust its capabilities following business demands in a dynamic fashion. Our design accounts for BFL’s ability to quantify each client’s contribution to the global learning process. Such quantification provides a rational metric for distributing the rewards among federated clients and helps discover malicious participants that may poison the global model.","PeriodicalId":360504,"journal":{"name":"Proceedings of the 51st International Conference on Parallel Processing","volume":"46 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-06-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133214145","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Guang-Ming Li, Yue Hu, Miao Zhang, Ji Liu, Quanjun Yin, Yong Peng, D. Dou
{"title":"FedHiSyn: A Hierarchical Synchronous Federated Learning Framework for Resource and Data Heterogeneity","authors":"Guang-Ming Li, Yue Hu, Miao Zhang, Ji Liu, Quanjun Yin, Yong Peng, D. Dou","doi":"10.1145/3545008.3545065","DOIUrl":"https://doi.org/10.1145/3545008.3545065","url":null,"abstract":"Federated Learning (FL) enables training a global model without sharing the decentralized raw data stored on multiple devices to protect data privacy. Due to the diverse capacity of the devices, FL frameworks struggle to tackle the problems of straggler effects and outdated models. In addition, the data heterogeneity incurs severe accuracy degradation of the global model in the FL training process. To address aforementioned issues, we propose a hierarchical synchronous FL framework, i.e., FedHiSyn. FedHiSyn first clusters all available devices into a small number of categories based on their computing capacity. After a certain interval of local training, the models trained in different categories are simultaneously uploaded to a central server. Within a single category, the devices communicate the local updated model weights to each other based on a ring topology. As the efficiency of training in the ring topology prefers devices with homogeneous resources, the classification based on the computing capacity mitigates the impact of straggler effects. Besides, the combination of the synchronous update of multiple categories and the device communication within a single category help address the data heterogeneity issue while achieving high accuracy. We evaluate the proposed framework based on MNIST, EMNIST, CIFAR10 and CIFAR100 datasets and diverse heterogeneous settings of devices. Experimental results show that FedHiSyn outperforms six baseline methods, e.g., FedAvg, SCAFFOLD, and FedAT, in terms of training accuracy and efficiency.","PeriodicalId":360504,"journal":{"name":"Proceedings of the 51st International Conference on Parallel Processing","volume":"9 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-06-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130358025","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Automatic Differentiation of Parallel Loops with Formal Methods","authors":"J. Hückelheim, L. Hascoët","doi":"10.1145/3545008.3545089","DOIUrl":"https://doi.org/10.1145/3545008.3545089","url":null,"abstract":"This paper presents a novel combination of reverse mode automatic differentiation and formal methods, to enable efficient differentiation of (or backpropagation through) shared-memory parallel loops. Compared to the state of the art, our approach can reduce the need for atomic updates or private data copies during the parallel derivative computation, even in the presence of unstructured or data-dependent data access patterns. This is achieved by gathering information about the memory access patterns from the input program, which is assumed to be correctly parallelized. This information is then used to build a model of assertions in a theorem prover, which can be used to check the safety of shared memory accesses during the parallel derivative loops. We demonstrate this approach on scientific computing benchmarks including a lattice-Boltzmann method (LBM) solver from the Parboil benchmark suite and a Green’s function Monte Carlo (GFMC) kernel from the CORAL benchmark suite.","PeriodicalId":360504,"journal":{"name":"Proceedings of the 51st International Conference on Parallel Processing","volume":"97 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-03-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117320815","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Srđan Milaković, Oguz Selvitopi, Israt Nisa, Zoran Budimlic, A. Buluç
{"title":"Parallel Algorithms for Masked Sparse Matrix-Matrix Products","authors":"Srđan Milaković, Oguz Selvitopi, Israt Nisa, Zoran Budimlic, A. Buluç","doi":"10.1145/3545008.3545048","DOIUrl":"https://doi.org/10.1145/3545008.3545048","url":null,"abstract":"Computing the product of two sparse matrices (SpGEMM) is a fundamental operation in various combinatorial and graph algorithms as well as various bioinformatics and data analytics applications for computing inner-product similarities. For an important class of algorithms, only a subset of the output entries are needed, and the resulting operation is known as Masked SpGEMM since a subset of the output entries is considered to be “masked out”. Existing algorithms for Masked SpGEMM usually do not consider mask as part of multiplication and either first compute a regular SpGEMM followed by masking, or perform a sparse inner product only for output elements that are not masked out. In this work, we investigate various novel algorithms and data structures for this rather challenging and important computation, and provide guidelines on how to design a fast Masked-SpGEMM for shared-memory architectures. Our evaluations show that factors such as matrix and mask density, mask structure and cache behavior play a vital role in attaining high performance for Masked SpGEMM. We evaluate our algorithms on a large number of real-world and synthetic matrices using several real-world benchmarks and show that our algorithms in most cases significantly outperform the state of the art for Masked SpGEMM implementations.","PeriodicalId":360504,"journal":{"name":"Proceedings of the 51st International Conference on Parallel Processing","volume":"37 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-11-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130005010","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Shengwei Li, Zhiquan Lai, Dongsheng Li, Xiangyu Ye, Yabo Duan
{"title":"EmbRace: Accelerating Sparse Communication for Distributed Training of Deep Neural Networks","authors":"Shengwei Li, Zhiquan Lai, Dongsheng Li, Xiangyu Ye, Yabo Duan","doi":"10.1145/3545008.3545011","DOIUrl":"https://doi.org/10.1145/3545008.3545011","url":null,"abstract":"Distributed data-parallel training has been widely adopted for deep neural network (DNN) models. Although current deep learning (DL) frameworks scale well for dense models like image classification models, we find that these DL frameworks have relatively low scalability for sparse models like natural language processing (NLP) models that have highly sparse embedding tables. Most existing works overlook the sparsity of model parameters thus suffering from significant but unnecessary communication overhead. In this paper, we propose EmbRace, an efficient communication framework to accelerate communications of distributed training for sparse models. EmbRace introduces Sparsity-aware Hybrid Communication, which integrates AlltoAll and model parallelism into data-parallel training, so as to reduce the communication overhead of highly sparse parameters. To effectively overlap sparse communication with both backward and forward computation, EmbRace further designs a 2D Communication Scheduling approach which optimizes the model computation procedure, relaxes the dependency of embeddings, and schedules the sparse communications of each embedding row with a priority queue. We have implemented a prototype of EmbRace based on PyTorch and Horovod, and conducted comprehensive evaluations with four representative NLP models. Experimental results show that EmbRace achieves up to 2.41 × speedup compared to the state-of-the-art distributed training baselines.","PeriodicalId":360504,"journal":{"name":"Proceedings of the 51st International Conference on Parallel Processing","volume":"74 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-10-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128726026","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Tesseract: Parallelize the Tensor Parallelism Efficiently","authors":"Boxiang Wang, Qifan Xu, Zhengda Bian, Yang You","doi":"10.1145/3545008.3545087","DOIUrl":"https://doi.org/10.1145/3545008.3545087","url":null,"abstract":"Together with the improvements in state-of-the-art accuracies of various tasks, deep learning models are getting significantly larger. However, it is extremely difficult to implement these large models because limited GPU memory makes it impossible to fit large models into a single GPU or even a GPU server. Besides, it is highly necessary to reduce the training time for large models. Previous methods like Megatron-LM implemented a 1-Dimensional distributed method to use GPUs to speed up the training. However, these methods have a high communication overhead and a low scaling efficiency on large-scale clusters. To solve these problems, we propose Tesseract, highly scalable tensor parallelism with a novel design. It increases efficiency by reducing communication overhead and lowers the memory required for each GPU. By introducing the novel dimension into tensor parallelism, Tesseract greatly increases the memory capacity of tensor parallelism. Concretely, this new dimension furthermore increases the degree of tensor parallelism. Compared to previous 1-D and 2-D methods, Tesseract manages to reduce the communication cost on each layer, resulting in speedups of 1.38x and 1.53x respectively with strong scaling. In weak scaling experiments, Tesseract achieves a maximum of 4.0/1.7 times inference speedup and 3.4/1.7 times throughput improvement compared to 1-D/2-D methods, respectively. By introducing Tesseract, we offer a more efficient and scalable way to implement large deep learning models with limited GPU resources.","PeriodicalId":360504,"journal":{"name":"Proceedings of the 51st International Conference on Parallel Processing","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-05-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115540691","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Changxi Liu, Hailong Yang, Xu Liu, Zhongzhi Luan, D. Qian
{"title":"Vectorizing SpMV by Exploiting Dynamic Regular Patterns","authors":"Changxi Liu, Hailong Yang, Xu Liu, Zhongzhi Luan, D. Qian","doi":"10.1145/3545008.3545042","DOIUrl":"https://doi.org/10.1145/3545008.3545042","url":null,"abstract":"Modern optimizing compilers can exploit memory access and computation patterns to generate vectorized codes. However, such patterns in irregular programs such as SpMV are unknown until runtime due to the input dependence. Thus, either compiler’s static optimization or profile-guided optimization cannot represent the patterns for any common input, which leads to suboptimal vectorization. To address the above drawback, we propose DynVec, a framework to automatically exploit regular patterns buried deeply inside SpMV programs and apply corresponding optimizations for better vectorization. Due to the ability to represent instruction features and identify regular patterns with effective feature extraction and data re-arranging methods, DynVec can generate highly efficient vectorized codes by replacing gather/scatter/reduction operations with optimized operation groups. We evaluate DynVec on optimizing SpMV with representative sparse matrix datasets. The experiment results show that DynVec achieves significant speedup compared to the state-of-the-art SpMV implementations across a range of platforms.","PeriodicalId":360504,"journal":{"name":"Proceedings of the 51st International Conference on Parallel Processing","volume":"136 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-10-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132202326","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Proceedings of the 51st International Conference on Parallel Processing","authors":"W. Nagel, W. Walter, Wolfgang Lehner","doi":"10.1145/3545008","DOIUrl":"https://doi.org/10.1145/3545008","url":null,"abstract":"","PeriodicalId":360504,"journal":{"name":"Proceedings of the 51st International Conference on Parallel Processing","volume":"6 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2006-08-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115619851","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}