2022 IEEE 29th International Conference on High Performance Computing, Data, and Analytics (HiPC)最新文献

Message from the HiPC 2022 General Co-Chairs 2022年重债穷国会议共同主席致辞

2022 IEEE 29th International Conference on High Performance Computing, Data, and Analytics (HiPC) Pub Date : 2022-12-01 DOI: 10.1109/hipc56025.2022.00005

引用次数: 0

Accelerating Prefix Scan with in-network computing on Intel PIUMA 在Intel PIUMA上加速前缀扫描与网络内计算

2022 IEEE 29th International Conference on High Performance Computing, Data, and Analytics (HiPC) Pub Date : 2022-12-01 DOI: 10.1109/HiPC56025.2022.00020

Kartik Lakhotia, F. Petrini, R. Kannan, V. Prasanna

{"title":"Accelerating Prefix Scan with in-network computing on Intel PIUMA","authors":"Kartik Lakhotia, F. Petrini, R. Kannan, V. Prasanna","doi":"10.1109/HiPC56025.2022.00020","DOIUrl":"https://doi.org/10.1109/HiPC56025.2022.00020","url":null,"abstract":"Prefix Scan is a versatile collective used in several classes of algorithms including sorting, lexical analysis, graph analytics, and regex matching. It is also a powerful tool to perform tree operations and load balancing. However, host-based Prefix Scan implementations incur high latency, large network traffic and poor scalability on large distributed systems.We explore in-network computation to accelerate Prefix Scan, using switches with data aggregation capabilities. We discuss the fundamental challenges associated with offloading Prefix Scan onto a network, and resolve them with innovations in dataflow topology and embedding methodology. We implement the proposed approach on the Intel PIUMA system. To the best of our knowledge, this is the first realization of a Prefix Scan offloading onto network switches.Our in-network Prefix Scan is highly scalable with less than 5μs latency on 16K PIUMA nodes and 6× lower latency than the host-based Prefix Scan. The performance benefits directly translate to improved workload scalability, as we demonstrate using a key bioinformatics application called Sequence Alignment.","PeriodicalId":119363,"journal":{"name":"2022 IEEE 29th International Conference on High Performance Computing, Data, and Analytics (HiPC)","volume":"120 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116890030","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

Efficient Personalized and Non-Personalized Alltoall Communication for Modern Multi-HCA GPU-Based Clusters 现代多hca gpu集群的高效个性化和非个性化全通信

2022 IEEE 29th International Conference on High Performance Computing, Data, and Analytics (HiPC) Pub Date : 2022-12-01 DOI: 10.1109/HiPC56025.2022.00025

K. Suresh, Akshay Paniraja Guptha, Benjamin Michalowicz, B. Ramesh, M. Abduljabbar, A. Shafi, H. Subramoni, D. Panda

引用次数: 0

Keynote 2: P Sadayappan 主题演讲2:P Sadayappan

2022 IEEE 29th International Conference on High Performance Computing, Data, and Analytics (HiPC) Pub Date : 2022-12-01 DOI: 10.1109/hipc56025.2022.00011

引用次数: 0

Keynote 1: Paolo Lenne

2022 IEEE 29th International Conference on High Performance Computing, Data, and Analytics (HiPC) Pub Date : 2022-12-01 DOI: 10.1109/hipc56025.2022.00010

引用次数: 0

HiPC 2022 Organization HiPC 2022组织

2022 IEEE 29th International Conference on High Performance Computing, Data, and Analytics (HiPC) Pub Date : 2022-12-01 DOI: 10.1109/hipc56025.2022.00007

引用次数: 0

Input Feature Pruning for Accelerating GNN Inference on Heterogeneous Platforms 异构平台上加速GNN推理的输入特征剪枝

2022 IEEE 29th International Conference on High Performance Computing, Data, and Analytics (HiPC) Pub Date : 2022-12-01 DOI: 10.1109/HiPC56025.2022.00045

Jason Yik, S. Kuppannagari, Hanqing Zeng, V. Prasanna

{"title":"Input Feature Pruning for Accelerating GNN Inference on Heterogeneous Platforms","authors":"Jason Yik, S. Kuppannagari, Hanqing Zeng, V. Prasanna","doi":"10.1109/HiPC56025.2022.00045","DOIUrl":"https://doi.org/10.1109/HiPC56025.2022.00045","url":null,"abstract":"Graph Neural Networks (GNNs) are an emerging class of machine learning models which utilize structured graph information and node features to reduce high-dimensional input data to low-dimensional embeddings, from which predictions can be made. Due to the compounding effect of aggregating neighbor information, GNN inferences require raw data from many times more nodes than are targeted for prediction. Thus, on heterogeneous compute platforms, inference latency can be largely subject to the inter-device communication cost of transferring input feature data to the GPU/accelerator before computation has even begun. In this paper, we analyze the trade-off effect of pruning input features from GNN models, reducing the volume of raw data that the model works with to lower communication latency at the expense of an expected decrease in the overall model accuracy. We develop greedy and regression-based algorithms to determine which features to retain for optimal prediction accuracy. We evaluate pruned model variants and find that they can reduce inference latency by up to 80% with an accuracy loss of less than 5% compared to non-pruned models. Furthermore, we show that the latency reductions from input feature pruning can be extended under different system variables such as batch size and floating point precision.","PeriodicalId":119363,"journal":{"name":"2022 IEEE 29th International Conference on High Performance Computing, Data, and Analytics (HiPC)","volume":"30 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133407326","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

COMPROF and COMPLACE: Shared-Memory Communication Profiling and Automated Thread Placement via Dynamic Binary Instrumentation COMPROF和COMPLACE:通过动态二进制检测实现共享内存通信分析和自动线程放置

2022 IEEE 29th International Conference on High Performance Computing, Data, and Analytics (HiPC) Pub Date : 2022-12-01 DOI: 10.1109/HiPC56025.2022.00040

Ryan Kirkpatrick, Christopher Brown, Vladimir Janjic

{"title":"COMPROF and COMPLACE: Shared-Memory Communication Profiling and Automated Thread Placement via Dynamic Binary Instrumentation","authors":"Ryan Kirkpatrick, Christopher Brown, Vladimir Janjic","doi":"10.1109/HiPC56025.2022.00040","DOIUrl":"https://doi.org/10.1109/HiPC56025.2022.00040","url":null,"abstract":"This paper presents COMPROF and COMPLACE, a novel profiling tool and thread placement technique for shared-memory architectures that requires no recompilation or user intervention. We use dynamic binary instrumentation to intercept memory operations and estimate inter-thread communication overhead, deriving (and possibly visualising) a communication graph of data-sharing between threads. We then use this graph to map threads to cores in order to optimise memory traffic through the memory system. Different paths through a system’s memory hierarchy have different latency, throughput and energy properties, COMPLACE exploits this heterogeneity to provide automatic performance and energy improvements for multithreaded programs. We demonstrate COMPLACE on the NAS Parallel Benchmark (NPB) suite where, using our technique, we are able to achieve improvements of up to 12% in the execution time and up to 10% in the energy consumption (compared to default Linux scheduling) while not requiring any modification or recompilation of the application code.","PeriodicalId":119363,"journal":{"name":"2022 IEEE 29th International Conference on High Performance Computing, Data, and Analytics (HiPC)","volume":"12 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115121152","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Accelerating Broadcast Communication with GPU Compression for Deep Learning Workloads 加速广播通信与GPU压缩深度学习工作负载

2022 IEEE 29th International Conference on High Performance Computing, Data, and Analytics (HiPC) Pub Date : 2022-12-01 DOI: 10.1109/HiPC56025.2022.00016

Qinghua Zhou, Quentin G. Anthony, A. Shafi, H. Subramoni, D. Panda

{"title":"Accelerating Broadcast Communication with GPU Compression for Deep Learning Workloads","authors":"Qinghua Zhou, Quentin G. Anthony, A. Shafi, H. Subramoni, D. Panda","doi":"10.1109/HiPC56025.2022.00016","DOIUrl":"https://doi.org/10.1109/HiPC56025.2022.00016","url":null,"abstract":"With the rapidly increasing model sizes, state-of-the-art Deep Learning (DL) models rely on multiple GPU nodes to run distributed training. Large message communication of GPU data between the GPUs is becoming a performance bottleneck in the overall training performance. GPU-Aware MPI libraries are widely adopted for state-of-the-art DL frameworks to improve communication performance. In the existing optimization solutions for Distributed Data-Parallel (DDP) training, the broadcast operation is often utilized to sync up the updated model parameters among all the GPUs. However, for state-of-the-art GPU-Aware MPI libraries, broadcasting large GPU data turns to overburden the training performance due to the limited bandwidth of interconnect between the GPU nodes. On the other hand, the recent research on using GPU-based compression libraries to lower the pressure on the nearly saturated interconnection and co-designing online compression with the communication pattern provides a new perspective to optimize the performance of broadcast on modern GPU clusters.In this paper, we redesign the GPU-Aware MPI library to enable efficient collective-level online compression with an optimized chunked-chain scheme for large message broadcast communication. The proposed design is evaluated to show benefits at both microbenchmark and application levels. At the microbenchmark level, the proposed design can reduce the broadcast communication latency by up to 80.9% compared to the baseline using a state-of-the-art MPI library and 55.1% compared to the existing point-to-point-based compression on modern GPU clusters. For DDP training with PyTorch, the proposed design reduces the training time by up to 15.0% and 6.4% compared to the existing chunked-chain scheme and point-to-point-based compression, respectively, while keeping similar training accuracy. To the best of our knowledge, this is the first work that leverages online GPU-based compression techniques to significantly accelerate broadcast communication for DL workloads.","PeriodicalId":119363,"journal":{"name":"2022 IEEE 29th International Conference on High Performance Computing, Data, and Analytics (HiPC)","volume":"11 8","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"120972327","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Keynote 4: Per Stenstr̈m 主题演讲4:Per Stenstr + m

2022 IEEE 29th International Conference on High Performance Computing, Data, and Analytics (HiPC) Pub Date : 2022-12-01 DOI: 10.1109/hipc56025.2022.00013

引用次数: 0