Proceedings of the 51st International Conference on Parallel Processing最新文献_第9页

FAIR-BFL: Flexible and Incentive Redesign for Blockchain-based Federated Learning FAIR-BFL:基于区块链的联邦学习的灵活和激励的重新设计

Proceedings of the 51st International Conference on Parallel Processing Pub Date : 2022-06-26 DOI: 10.1145/3545008.3545040

Rongxin Xu, Shiva Raj Pokhrel, Qiujun Lan, Gang Li

{"title":"FAIR-BFL: Flexible and Incentive Redesign for Blockchain-based Federated Learning","authors":"Rongxin Xu, Shiva Raj Pokhrel, Qiujun Lan, Gang Li","doi":"10.1145/3545008.3545040","DOIUrl":"https://doi.org/10.1145/3545008.3545040","url":null,"abstract":"Vanilla Federated learning (FL) relies on the centralized global aggregation mechanism and assumes that all clients are honest. This makes it a challenge for FL to alleviate the single point of failure and dishonest clients. These impending challenges in the design philosophy of FL call for blockchain-based federated learning (BFL) due to the benefits of coupling FL and blockchain (e.g., democracy, incentive, and immutability). However, one problem in vanilla BFL is that its capabilities do not follow adopters’ needs in a dynamic fashion. Besides, vanilla BFL relies on unverifiable clients’ self-reported contributions like data size because checking clients’ raw data is not allowed in FL for privacy concerns. We design and evaluate a novel BFL framework, and resolve the identified challenges in vanilla BFL with greater flexibility and incentive mechanism called FAIR-BFL. In contrast to existing works, FAIR-BFL offers unprecedented flexibility via the modular design, allowing adopters to adjust its capabilities following business demands in a dynamic fashion. Our design accounts for BFL’s ability to quantify each client’s contribution to the global learning process. Such quantification provides a rational metric for distributing the rewards among federated clients and helps discover malicious participants that may poison the global model.","PeriodicalId":360504,"journal":{"name":"Proceedings of the 51st International Conference on Parallel Processing","volume":"46 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-06-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133214145","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 4

FedHiSyn: A Hierarchical Synchronous Federated Learning Framework for Resource and Data Heterogeneity FedHiSyn:面向资源和数据异构的分层同步联邦学习框架

Proceedings of the 51st International Conference on Parallel Processing Pub Date : 2022-06-21 DOI: 10.1145/3545008.3545065

Guang-Ming Li, Yue Hu, Miao Zhang, Ji Liu, Quanjun Yin, Yong Peng, D. Dou

{"title":"FedHiSyn: A Hierarchical Synchronous Federated Learning Framework for Resource and Data Heterogeneity","authors":"Guang-Ming Li, Yue Hu, Miao Zhang, Ji Liu, Quanjun Yin, Yong Peng, D. Dou","doi":"10.1145/3545008.3545065","DOIUrl":"https://doi.org/10.1145/3545008.3545065","url":null,"abstract":"Federated Learning (FL) enables training a global model without sharing the decentralized raw data stored on multiple devices to protect data privacy. Due to the diverse capacity of the devices, FL frameworks struggle to tackle the problems of straggler effects and outdated models. In addition, the data heterogeneity incurs severe accuracy degradation of the global model in the FL training process. To address aforementioned issues, we propose a hierarchical synchronous FL framework, i.e., FedHiSyn. FedHiSyn first clusters all available devices into a small number of categories based on their computing capacity. After a certain interval of local training, the models trained in different categories are simultaneously uploaded to a central server. Within a single category, the devices communicate the local updated model weights to each other based on a ring topology. As the efficiency of training in the ring topology prefers devices with homogeneous resources, the classification based on the computing capacity mitigates the impact of straggler effects. Besides, the combination of the synchronous update of multiple categories and the device communication within a single category help address the data heterogeneity issue while achieving high accuracy. We evaluate the proposed framework based on MNIST, EMNIST, CIFAR10 and CIFAR100 datasets and diverse heterogeneous settings of devices. Experimental results show that FedHiSyn outperforms six baseline methods, e.g., FedAvg, SCAFFOLD, and FedAT, in terms of training accuracy and efficiency.","PeriodicalId":360504,"journal":{"name":"Proceedings of the 51st International Conference on Parallel Processing","volume":"9 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-06-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130358025","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 12

Automatic Differentiation of Parallel Loops with Formal Methods 用形式化方法自动微分并联回路

Proceedings of the 51st International Conference on Parallel Processing Pub Date : 2022-03-28 DOI: 10.1145/3545008.3545089

J. Hückelheim, L. Hascoët

引用次数: 2

Parallel Algorithms for Masked Sparse Matrix-Matrix Products 屏蔽稀疏矩阵-矩阵乘积的并行算法

Proceedings of the 51st International Conference on Parallel Processing Pub Date : 2021-11-18 DOI: 10.1145/3545008.3545048

Srđan Milaković, Oguz Selvitopi, Israt Nisa, Zoran Budimlic, A. Buluç

{"title":"Parallel Algorithms for Masked Sparse Matrix-Matrix Products","authors":"Srđan Milaković, Oguz Selvitopi, Israt Nisa, Zoran Budimlic, A. Buluç","doi":"10.1145/3545008.3545048","DOIUrl":"https://doi.org/10.1145/3545008.3545048","url":null,"abstract":"Computing the product of two sparse matrices (SpGEMM) is a fundamental operation in various combinatorial and graph algorithms as well as various bioinformatics and data analytics applications for computing inner-product similarities. For an important class of algorithms, only a subset of the output entries are needed, and the resulting operation is known as Masked SpGEMM since a subset of the output entries is considered to be “masked out”. Existing algorithms for Masked SpGEMM usually do not consider mask as part of multiplication and either first compute a regular SpGEMM followed by masking, or perform a sparse inner product only for output elements that are not masked out. In this work, we investigate various novel algorithms and data structures for this rather challenging and important computation, and provide guidelines on how to design a fast Masked-SpGEMM for shared-memory architectures. Our evaluations show that factors such as matrix and mask density, mask structure and cache behavior play a vital role in attaining high performance for Masked SpGEMM. We evaluate our algorithms on a large number of real-world and synthetic matrices using several real-world benchmarks and show that our algorithms in most cases significantly outperform the state of the art for Masked SpGEMM implementations.","PeriodicalId":360504,"journal":{"name":"Proceedings of the 51st International Conference on Parallel Processing","volume":"37 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-11-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130005010","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

EmbRace: Accelerating Sparse Communication for Distributed Training of Deep Neural Networks 拥抱:加速稀疏通信用于深度神经网络的分布式训练

Proceedings of the 51st International Conference on Parallel Processing Pub Date : 2021-10-18 DOI: 10.1145/3545008.3545011

Shengwei Li, Zhiquan Lai, Dongsheng Li, Xiangyu Ye, Yabo Duan

{"title":"EmbRace: Accelerating Sparse Communication for Distributed Training of Deep Neural Networks","authors":"Shengwei Li, Zhiquan Lai, Dongsheng Li, Xiangyu Ye, Yabo Duan","doi":"10.1145/3545008.3545011","DOIUrl":"https://doi.org/10.1145/3545008.3545011","url":null,"abstract":"Distributed data-parallel training has been widely adopted for deep neural network (DNN) models. Although current deep learning (DL) frameworks scale well for dense models like image classification models, we find that these DL frameworks have relatively low scalability for sparse models like natural language processing (NLP) models that have highly sparse embedding tables. Most existing works overlook the sparsity of model parameters thus suffering from significant but unnecessary communication overhead. In this paper, we propose EmbRace, an efficient communication framework to accelerate communications of distributed training for sparse models. EmbRace introduces Sparsity-aware Hybrid Communication, which integrates AlltoAll and model parallelism into data-parallel training, so as to reduce the communication overhead of highly sparse parameters. To effectively overlap sparse communication with both backward and forward computation, EmbRace further designs a 2D Communication Scheduling approach which optimizes the model computation procedure, relaxes the dependency of embeddings, and schedules the sparse communications of each embedding row with a priority queue. We have implemented a prototype of EmbRace based on PyTorch and Horovod, and conducted comprehensive evaluations with four representative NLP models. Experimental results show that EmbRace achieves up to 2.41 × speedup compared to the state-of-the-art distributed training baselines.","PeriodicalId":360504,"journal":{"name":"Proceedings of the 51st International Conference on Parallel Processing","volume":"74 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-10-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128726026","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 3

Tesseract: Parallelize the Tensor Parallelism Efficiently Tesseract:有效地并行化张量并行性

Proceedings of the 51st International Conference on Parallel Processing Pub Date : 2021-05-30 DOI: 10.1145/3545008.3545087

Boxiang Wang, Qifan Xu, Zhengda Bian, Yang You

{"title":"Tesseract: Parallelize the Tensor Parallelism Efficiently","authors":"Boxiang Wang, Qifan Xu, Zhengda Bian, Yang You","doi":"10.1145/3545008.3545087","DOIUrl":"https://doi.org/10.1145/3545008.3545087","url":null,"abstract":"Together with the improvements in state-of-the-art accuracies of various tasks, deep learning models are getting significantly larger. However, it is extremely difficult to implement these large models because limited GPU memory makes it impossible to fit large models into a single GPU or even a GPU server. Besides, it is highly necessary to reduce the training time for large models. Previous methods like Megatron-LM implemented a 1-Dimensional distributed method to use GPUs to speed up the training. However, these methods have a high communication overhead and a low scaling efficiency on large-scale clusters. To solve these problems, we propose Tesseract, highly scalable tensor parallelism with a novel design. It increases efficiency by reducing communication overhead and lowers the memory required for each GPU. By introducing the novel dimension into tensor parallelism, Tesseract greatly increases the memory capacity of tensor parallelism. Concretely, this new dimension furthermore increases the degree of tensor parallelism. Compared to previous 1-D and 2-D methods, Tesseract manages to reduce the communication cost on each layer, resulting in speedups of 1.38x and 1.53x respectively with strong scaling. In weak scaling experiments, Tesseract achieves a maximum of 4.0/1.7 times inference speedup and 3.4/1.7 times throughput improvement compared to 1-D/2-D methods, respectively. By introducing Tesseract, we offer a more efficient and scalable way to implement large deep learning models with limited GPU resources.","PeriodicalId":360504,"journal":{"name":"Proceedings of the 51st International Conference on Parallel Processing","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-05-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115540691","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 9

Vectorizing SpMV by Exploiting Dynamic Regular Patterns 利用动态规则模式对SpMV进行矢量化

Proceedings of the 51st International Conference on Parallel Processing Pub Date : 2019-10-24 DOI: 10.1145/3545008.3545042

Changxi Liu, Hailong Yang, Xu Liu, Zhongzhi Luan, D. Qian

引用次数: 3

Proceedings of the 51st International Conference on Parallel Processing 第51届并行处理国际会议论文集

Proceedings of the 51st International Conference on Parallel Processing Pub Date : 2006-08-28 DOI: 10.1145/3545008

W. Nagel, W. Walter, Wolfgang Lehner

引用次数: 0