Proceedings of the 51st International Conference on Parallel Processing最新文献_第5页

ParaGraph: An application-simulator interface and toolkit for hardware-software co-design 段落:用于硬件软件协同设计的应用程序模拟器接口和工具包

Proceedings of the 51st International Conference on Parallel Processing Pub Date : 2022-08-29 DOI: 10.1145/3545008.3545069

Mikhail Isaev, Nic McDonald, Jeffrey S. Young, R. Vuduc

{"title":"ParaGraph: An application-simulator interface and toolkit for hardware-software co-design","authors":"Mikhail Isaev, Nic McDonald, Jeffrey S. Young, R. Vuduc","doi":"10.1145/3545008.3545069","DOIUrl":"https://doi.org/10.1145/3545008.3545069","url":null,"abstract":"ParaGraph is an open-source toolkit for use in co-designing hardware and software for supercomputer-scale systems. It bridges an infrastructure gap between an application target and existing high-fidelity computer-network simulators. The first component of ParaGraph is a high-level graph representation of a parallel program, which a) faithfully represents parallelism and communication, b) can be extracted automatically from a compiler, and c) is “tuned” for use with network simulators. The second is a runtime that can emulate the representation’s dynamic execution for a simulator. User-extensible mechanisms are available for modeling on-node performance and transforming high-level communication into operations that backend simulators understand. Case studies include deep learning workloads that are extracted automatically from programs written in JAX and TensorFlow and interfaced with several event-driven network simulators. These studies show how system designers can use ParaGraph to build flexible end-to-end software-hardware co-design workflows to tweak communication libraries, find future hardware bottlenecks, and validate simulations with traces.","PeriodicalId":360504,"journal":{"name":"Proceedings of the 51st International Conference on Parallel Processing","volume":"99 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-08-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127521377","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

FedClassAvg: Local Representation Learning for Personalized Federated Learning on Heterogeneous Neural Networks FedClassAvg:基于异构神经网络的个性化联邦学习的局部表示学习

Proceedings of the 51st International Conference on Parallel Processing Pub Date : 2022-08-29 DOI: 10.1145/3545008.3545073

Jaehee Jang, Heonseok Ha, Dahuin Jung, Sungroh Yoon

{"title":"FedClassAvg: Local Representation Learning for Personalized Federated Learning on Heterogeneous Neural Networks","authors":"Jaehee Jang, Heonseok Ha, Dahuin Jung, Sungroh Yoon","doi":"10.1145/3545008.3545073","DOIUrl":"https://doi.org/10.1145/3545008.3545073","url":null,"abstract":"Personalized federated learning is aimed at allowing numerous clients to train personalized models while participating in collaborative training in a communication-efficient manner without exchanging private data. However, many personalized federated learning algorithms assume that clients have the same neural network architecture, and those for heterogeneous models remain understudied. In this study, we propose a novel personalized federated learning method called federated classifier averaging (FedClassAvg). Deep neural networks for supervised learning tasks consist of feature extractor and classifier layers. FedClassAvg aggregates classifier weights as an agreement on decision boundaries on feature spaces so that clients with not independently and identically distributed (non-iid) data can learn about scarce labels. In addition, local feature representation learning is applied to stabilize the decision boundaries and improve the local feature extraction capabilities for clients. While the existing methods require the collection of auxiliary data or model weights to generate a counterpart, FedClassAvg only requires clients to communicate with a couple of fully connected layers, which is highly communication-efficient. Moreover, FedClassAvg does not require extra optimization problems such as knowledge transfer, which requires intensive computation overhead. We evaluated FedClassAvg through extensive experiments and demonstrated it outperforms the current state-of-the-art algorithms on heterogeneous personalized federated learning tasks.","PeriodicalId":360504,"journal":{"name":"Proceedings of the 51st International Conference on Parallel Processing","volume":"5 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-08-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130734605","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 6

ParallelDualSPHysics: supporting efficient parallel fluid simulations through MPI-enabled SPH method paralleldualspphysics:通过启用mpi的SPH方法支持高效的并行流体模拟

Proceedings of the 51st International Conference on Parallel Processing Pub Date : 2022-08-29 DOI: 10.1145/3545008.3545016

Sifan Long, Xiao-Wei Guo, Xiaokang Fan, Chao Li, Kelvin Wong, Ran Zhao, Yi Liu, Sen Zhang, Canqun Yang

{"title":"ParallelDualSPHysics: supporting efficient parallel fluid simulations through MPI-enabled SPH method","authors":"Sifan Long, Xiao-Wei Guo, Xiaokang Fan, Chao Li, Kelvin Wong, Ran Zhao, Yi Liu, Sen Zhang, Canqun Yang","doi":"10.1145/3545008.3545016","DOIUrl":"https://doi.org/10.1145/3545008.3545016","url":null,"abstract":"Smoothed Particle Hydrodynamics (SPH) is a classical mesh-free particle method which has been successfully applied in the field of Computational Fluid Dynamics (CFD). Its advantages over traditional mesh-based methods have made it very popular in simulating problems involving large deformation and free-surface flow. The high computational cost of the SPH method has obstructed its vast application. A lot of research effort has been devoted to accelerating the SPH method using GPU and multi threading. However, developing efficient parallel SPH algorithms on modern high-performance computers (HPCs) remains significantly challenging, especially for simulating real-world engineering problems involving hundreds of millions of particles. In this paper, we proposed an MPI-enabled parallel SPH algorithm and developed the ParallelDualSPHysics1, an open-source software supporting efficient parallel fluid simulations. Based on an efficient domain decomposition scheme, the essential data structure and algorithms of DualSPHysics were refactored to build the parallel version. For collaborating with evenly distributed particles on a distributed-memory HPC system, the parallel particle interaction and particle update modules were introduced, which enabled the SPH solver to synchronize computations among multiple processors using MPI. In addition, the redesigned pre-processing and post-processing capabilities of the ParallelDualSPHysics supported the applications of this software in a wide range of areas. Real-life test cases with up to 120 million particles were simulated and analyzed on a modern HPC system. The results showed that the parallel efficiency of ParallelDualSPHysics exceeds 90 with up to 1024 CPU cores. It indicated that ParallelDualSPHysics has the potential for large-scale engineering applications.","PeriodicalId":360504,"journal":{"name":"Proceedings of the 51st International Conference on Parallel Processing","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-08-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129780317","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Lobster: Load Balance-Aware I/O for Distributed DNN Training 龙虾:分布式DNN训练的负载平衡感知I/O

Proceedings of the 51st International Conference on Parallel Processing Pub Date : 2022-08-29 DOI: 10.1145/3545008.3545090

Jie Liu, Bogdan Nicolae, Dong Li

{"title":"Lobster: Load Balance-Aware I/O for Distributed DNN Training","authors":"Jie Liu, Bogdan Nicolae, Dong Li","doi":"10.1145/3545008.3545090","DOIUrl":"https://doi.org/10.1145/3545008.3545090","url":null,"abstract":"The resource-hungry and time-consuming process of training Deep Neural Networks (DNNs) can be accelerated by optimizing and/or scaling computations on accelerators such as GPUs. However, the loading and pre-processing of training samples then often emerges as a new bottleneck. This data loading process engages a complex pipeline that extends from the sampling of training data on external storage to delivery of those data to GPUs, and that comprises not only expensive I/O operations but also decoding, shuffling, batching, augmentation, and other operations. We propose in this paper a new holistic approach to data loading that addresses three challenges not sufficiently addressed by other methods: I/O load imbalances among the GPUs on a node; rigid resource allocations to data loading and data preprocessing steps, which lead to idle resources and bottlenecks; and limited efficiency of caching strategies based on pre-fetching due to eviction of training samples needed soon at the expense of those needed later. We first present a study of key bottlenecks observed as training samples flow through the data loading and preprocessing pipeline. Then, we describe Lobster, a data loading runtime that uses performance modeling and advanced heuristics to combine flexible thread management with optimized eviction for distributed caching in order to mitigate I/O overheads and load imbalances. Experiments with a range of models and datasets show that the Lobster approach reduces both I/O overheads and end-to-end training times by up to 1.5 × compared with state-of-the-art approaches.","PeriodicalId":360504,"journal":{"name":"Proceedings of the 51st International Conference on Parallel Processing","volume":"20 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-08-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114375415","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 5

NCC: Neighbor-aware Congestion Control based on Reinforcement Learning for Datacenter Networks 基于强化学习的数据中心网络邻居感知拥塞控制

Proceedings of the 51st International Conference on Parallel Processing Pub Date : 2022-08-29 DOI: 10.1145/3545008.3545074

Haoyu Wang, Kevin Zheng, Charles Reiss, Haiying Shen

{"title":"NCC: Neighbor-aware Congestion Control based on Reinforcement Learning for Datacenter Networks","authors":"Haoyu Wang, Kevin Zheng, Charles Reiss, Haiying Shen","doi":"10.1145/3545008.3545074","DOIUrl":"https://doi.org/10.1145/3545008.3545074","url":null,"abstract":"The challenges of low latency, high throughput datacenter networks create new traffic management problems that require new congestion control mechanisms. Generally, the proposals to solve this problem have focused either on refining existing window-based congestion control like in TCP or on introducing a central controller to make congestion control decisions. In this paper, we propose a third approach, where nodes share network information with their neighbors and apply this information to make local decisions that limit global congestion. In our implementation, the rate limiting decisions on one node are driven by the local agent that uses reinforcement learning to optimize a combination of overall latency, throughput and the shared information. To make this approach efficient, the local agents choose overall rate limits for each node, and then a separate process assigns the traffic of individual flows within these limits. We show that, in trace-driven real implementation, our method achieves better congestion avoidance than several end-to-end and centralized mechanisms in prior work.","PeriodicalId":360504,"journal":{"name":"Proceedings of the 51st International Conference on Parallel Processing","volume":"118 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-08-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124530125","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

Themis: Fair Memory Subsystem Resource Sharing with Differentiated QoS in Public Clouds 主题:公共云中具有差异化QoS的公平内存子系统资源共享

Proceedings of the 51st International Conference on Parallel Processing Pub Date : 2022-08-29 DOI: 10.1145/3545008.3545064

Wenda Tang, Senbo Fu, Y. Ke, Qian Peng, Feng Gao

{"title":"Themis: Fair Memory Subsystem Resource Sharing with Differentiated QoS in Public Clouds","authors":"Wenda Tang, Senbo Fu, Y. Ke, Qian Peng, Feng Gao","doi":"10.1145/3545008.3545064","DOIUrl":"https://doi.org/10.1145/3545008.3545064","url":null,"abstract":"To reduce the increasing cost of building and operating cloud data centers, cloud providers are seeking various mechanisms to achieve higher resource effectiveness. For example, cloud operators are leveraging dynamic resource management techniques to consolidate a higher density of application workloads into commodity physical servers to maximize server resource utilization. However, higher workload density is a major source of performance interference problems in multi-tenant clouds. Existing performance isolation techniques such as dedicated CPU cores for specific workloads are not enough as there are still common resource (e.g., last-level cache and memory bandwidth in memory subsystem) on the processor that are shared among all CPUs on the same NUMA node. While prior work has proposed a variety of resource partitioning techniques, it still remains unexplored to characterize the impact of memory subsystem resource partitioning for the consolidated workloads with different priorities and investigate software support to dynamically manage memory subsystem resource sharing in a real-time manner. To bridge the gap, we propose Themis, a feedback-based controller that enables a priority-aware and fairness-aware memory subsystem resource management strategy to guarantee the performance of high-priority workloads while maintaining fairness across all colocated workloads in high-density clouds. Themis is evaluated with multiple typical cloud applications in our data center environment. The results show that Themis improves the performance of various workloads by up to 3.15%, and fairness by more than 70% in memory subsystem resource allocation compared to existing state-of-the-art work.","PeriodicalId":360504,"journal":{"name":"Proceedings of the 51st International Conference on Parallel Processing","volume":"10 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-08-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127813429","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Micro-Benchmarking MPI Partitioned Point-to-Point Communication MPI分区点对点通信的微基准测试

Proceedings of the 51st International Conference on Parallel Processing Pub Date : 2022-08-29 DOI: 10.1145/3545008.3545088

Yiltan Hassan Temuçin, Ryan E. Grant, A. Afsahi

引用次数: 4

Atos: A Task-Parallel GPU Scheduler for Graph Analytics Atos:用于图形分析的任务并行GPU调度程序

Proceedings of the 51st International Conference on Parallel Processing Pub Date : 2022-08-29 DOI: 10.1145/3545008.3545056

Yuxin Chen, Benjamin Brock, Serban D. Porumbescu, A. Buluç, K. Yelick, J. Owens

{"title":"Atos: A Task-Parallel GPU Scheduler for Graph Analytics","authors":"Yuxin Chen, Benjamin Brock, Serban D. Porumbescu, A. Buluç, K. Yelick, J. Owens","doi":"10.1145/3545008.3545056","DOIUrl":"https://doi.org/10.1145/3545008.3545056","url":null,"abstract":"We present Atos, a task-parallel GPU dynamic scheduling framework that is especially targeted at dynamic irregular applications. Compared to the dominant Bulk Synchronous Parallel (BSP) frameworks, Atos exposes additional concurrency by supporting task-parallel formulations of applications with relaxed dependencies, achieving higher GPU utilization, which is particularly significant for problems with concurrency bottlenecks. Atos also offers implicit task-parallel load balancing in addition to data-parallel load balancing, providing users the flexibility to balance between them to achieve optimal performance. Finally, Atos allows users to adapt to different use cases by controlling the kernel strategy and task-parallel granularity. We demonstrate that each of these controls is important in practice. We evaluate and analyze the performance of Atos vs. BSP on three applications: breadth-first search, PageRank, and graph coloring. Atos implementations achieve geomean speedups of 3.44x, 2.1x, and 2.77x and peak speedups of 12.8x, 3.2x, and 9.08x across three case studies, compared to a state-of-the-art BSP GPU implementation. Beyond simply quantifying the speedup, we extensively analyze the reasons behind each speedup. This deeper understanding allows us to derive general guidelines for how to select the optimal Atos configuration for different applications. Finally, our analysis provides insights for future dynamic scheduling framework designs.","PeriodicalId":360504,"journal":{"name":"Proceedings of the 51st International Conference on Parallel Processing","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-08-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129847815","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 4

Accelerating Random Forest Classification on GPU and FPGA 基于GPU和FPGA的随机森林分类加速

Proceedings of the 51st International Conference on Parallel Processing Pub Date : 2022-08-29 DOI: 10.1145/3545008.3545067

Milan Shah, Reece Neff, Hancheng Wu, Marco Minutoli, Antonino Tumeo, M. Becchi

{"title":"Accelerating Random Forest Classification on GPU and FPGA","authors":"Milan Shah, Reece Neff, Hancheng Wu, Marco Minutoli, Antonino Tumeo, M. Becchi","doi":"10.1145/3545008.3545067","DOIUrl":"https://doi.org/10.1145/3545008.3545067","url":null,"abstract":"Random Forests (RFs) are a commonly used machine learning method for classification and regression tasks spanning a variety of application domains, including bioinformatics, business analytics, and software optimization. While prior work has focused primarily on improving performance of the training of RFs, many applications, such as malware identification, cancer prediction, and banking fraud detection, require fast RF classification. In this work, we accelerate RF classification on GPU and FPGA. In order to provide efficient support for large datasets, we propose a hierarchical memory layout suitable to the GPU/FPGA memory hierarchy. We design three RF classification code variants based on that layout, and we investigate GPU- and FPGA-specific considerations for these kernels. Our experimental evaluation, performed on an Nvidia Xp GPU and on a Xilinx Alveo U250 FPGA accelerator card using publicly available datasets on the scale of millions of samples and tens of features, covers various aspects. First, we evaluate the performance benefits of our hierarchical data structure over the standard compressed sparse row (CSR) format. Second, we compare our GPU implementation with cuML, a machine learning library targeting Nvidia GPUs. Third, we explore the performance/accuracy tradeoff resulting from the use of different tree depths in the RF. Finally, we perform a comparative performance analysis of our GPU and FPGA implementations. Our evaluation shows that, while reporting the best performance on GPU, our code variants outperform the CSR baseline both on GPU and FPGA. For high accuracy targets, our GPU implementation yields a 5-9 × speedup over CSR, and up to a 2 × speedup over Nvidia’s cuML library.","PeriodicalId":360504,"journal":{"name":"Proceedings of the 51st International Conference on Parallel Processing","volume":"2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-08-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129596685","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

DC4: Reconstructing Data-Credit-Coupled Congestion Control for Data Centers DC4:重构数据中心数据-信用耦合拥塞控制

Proceedings of the 51st International Conference on Parallel Processing Pub Date : 2022-08-29 DOI: 10.1145/3545008.3545023

Shan Huang, Dezun Dong, Lingbin Zeng, Zejia Zhou, Yukun Zhou, Xiangke Liao

{"title":"DC4: Reconstructing Data-Credit-Coupled Congestion Control for Data Centers","authors":"Shan Huang, Dezun Dong, Lingbin Zeng, Zejia Zhou, Yukun Zhou, Xiangke Liao","doi":"10.1145/3545008.3545023","DOIUrl":"https://doi.org/10.1145/3545008.3545023","url":null,"abstract":"Congestion control is crucial for the overall performance of data center networks and still faces considerable challenges. Recently, credit-driven congestion control has been emerging to enable precise flow control for current high-speed and highly dynamic data centers. However, existing credit-driven methods essentially separate credit and data packets, i.e., credits can fully regulate data packets, but they receive little feedback from the data packets. Accordingly, these approaches inevitably struggle with lossy credits and impaired throughput. To address the issue, we present data-credit-coupling congestion control named DC4. For a better understanding of the relationship between data and credit, we revisit the principle of credit-based congestion control and make the first attempt to explore the art of presenting the data-credit plane architecture. Based on the proposed data-credit framework, DC4 transforms the interaction between credit and data packets from one-way control to two-way coordination to achieve mutual benefits and dynamic balances between the credit and data packets. We conduct extensive experiments to evaluate the performance of our design and compare it with state-of-the-art protocols, including HPCC, ExpressPass, and Aeolus. Experimental results show that DC4 outperforms data-credit-separated approaches in terms of the flow completion time, throughput, and credit waste.","PeriodicalId":360504,"journal":{"name":"Proceedings of the 51st International Conference on Parallel Processing","volume":"516 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-08-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123081630","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0