{"title":"Message from the HiPC 2022 General Co-Chairs","authors":"","doi":"10.1109/hipc56025.2022.00005","DOIUrl":"https://doi.org/10.1109/hipc56025.2022.00005","url":null,"abstract":"","PeriodicalId":119363,"journal":{"name":"2022 IEEE 29th International Conference on High Performance Computing, Data, and Analytics (HiPC)","volume":"10 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115286608","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Kartik Lakhotia, F. Petrini, R. Kannan, V. Prasanna
{"title":"Accelerating Prefix Scan with in-network computing on Intel PIUMA","authors":"Kartik Lakhotia, F. Petrini, R. Kannan, V. Prasanna","doi":"10.1109/HiPC56025.2022.00020","DOIUrl":"https://doi.org/10.1109/HiPC56025.2022.00020","url":null,"abstract":"Prefix Scan is a versatile collective used in several classes of algorithms including sorting, lexical analysis, graph analytics, and regex matching. It is also a powerful tool to perform tree operations and load balancing. However, host-based Prefix Scan implementations incur high latency, large network traffic and poor scalability on large distributed systems.We explore in-network computation to accelerate Prefix Scan, using switches with data aggregation capabilities. We discuss the fundamental challenges associated with offloading Prefix Scan onto a network, and resolve them with innovations in dataflow topology and embedding methodology. We implement the proposed approach on the Intel PIUMA system. To the best of our knowledge, this is the first realization of a Prefix Scan offloading onto network switches.Our in-network Prefix Scan is highly scalable with less than 5μs latency on 16K PIUMA nodes and 6× lower latency than the host-based Prefix Scan. The performance benefits directly translate to improved workload scalability, as we demonstrate using a key bioinformatics application called Sequence Alignment.","PeriodicalId":119363,"journal":{"name":"2022 IEEE 29th International Conference on High Performance Computing, Data, and Analytics (HiPC)","volume":"120 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116890030","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
K. Suresh, Akshay Paniraja Guptha, Benjamin Michalowicz, B. Ramesh, M. Abduljabbar, A. Shafi, H. Subramoni, D. Panda
{"title":"Efficient Personalized and Non-Personalized Alltoall Communication for Modern Multi-HCA GPU-Based Clusters","authors":"K. Suresh, Akshay Paniraja Guptha, Benjamin Michalowicz, B. Ramesh, M. Abduljabbar, A. Shafi, H. Subramoni, D. Panda","doi":"10.1109/HiPC56025.2022.00025","DOIUrl":"https://doi.org/10.1109/HiPC56025.2022.00025","url":null,"abstract":"Graphics Processing Units (GPUs) have become ubiquitous in today’s supercomputing clusters primarily because of their high compute capability and power efficiency. Message Passing Interface (MPI) is a widely adopted programming model for large-scale GPU-based applications used in such clusters. Modern GPU-based systems have multiple HCAs. Previously, scientists have leveraged multi-HCA systems to accelerate inter-node transfers between CPUs using point-to-point primitives. In this work, we show the need for collective-level, multi-rail aware algorithms using MPI_Allgather as an example. We then propose an efficient multi-rail MPI_Allgather algorithm and extend it to MPI_Alltoall. We analyze the performance of this algorithm using OMB benchmark suite. We demonstrate approximately 30% and 43% improvement in non-personalized and personalized communication benchmarks respectively when compared with the state-of-the-art MPI libraries on 128 GPUs","PeriodicalId":119363,"journal":{"name":"2022 IEEE 29th International Conference on High Performance Computing, Data, and Analytics (HiPC)","volume":"44 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116024451","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Keynote 2: P Sadayappan","authors":"","doi":"10.1109/hipc56025.2022.00011","DOIUrl":"https://doi.org/10.1109/hipc56025.2022.00011","url":null,"abstract":"","PeriodicalId":119363,"journal":{"name":"2022 IEEE 29th International Conference on High Performance Computing, Data, and Analytics (HiPC)","volume":"31 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128236302","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Keynote 1: Paolo Lenne","authors":"","doi":"10.1109/hipc56025.2022.00010","DOIUrl":"https://doi.org/10.1109/hipc56025.2022.00010","url":null,"abstract":"","PeriodicalId":119363,"journal":{"name":"2022 IEEE 29th International Conference on High Performance Computing, Data, and Analytics (HiPC)","volume":"8 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122369722","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"HiPC 2022 Organization","authors":"","doi":"10.1109/hipc56025.2022.00007","DOIUrl":"https://doi.org/10.1109/hipc56025.2022.00007","url":null,"abstract":"","PeriodicalId":119363,"journal":{"name":"2022 IEEE 29th International Conference on High Performance Computing, Data, and Analytics (HiPC)","volume":"124 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129039570","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Jason Yik, S. Kuppannagari, Hanqing Zeng, V. Prasanna
{"title":"Input Feature Pruning for Accelerating GNN Inference on Heterogeneous Platforms","authors":"Jason Yik, S. Kuppannagari, Hanqing Zeng, V. Prasanna","doi":"10.1109/HiPC56025.2022.00045","DOIUrl":"https://doi.org/10.1109/HiPC56025.2022.00045","url":null,"abstract":"Graph Neural Networks (GNNs) are an emerging class of machine learning models which utilize structured graph information and node features to reduce high-dimensional input data to low-dimensional embeddings, from which predictions can be made. Due to the compounding effect of aggregating neighbor information, GNN inferences require raw data from many times more nodes than are targeted for prediction. Thus, on heterogeneous compute platforms, inference latency can be largely subject to the inter-device communication cost of transferring input feature data to the GPU/accelerator before computation has even begun. In this paper, we analyze the trade-off effect of pruning input features from GNN models, reducing the volume of raw data that the model works with to lower communication latency at the expense of an expected decrease in the overall model accuracy. We develop greedy and regression-based algorithms to determine which features to retain for optimal prediction accuracy. We evaluate pruned model variants and find that they can reduce inference latency by up to 80% with an accuracy loss of less than 5% compared to non-pruned models. Furthermore, we show that the latency reductions from input feature pruning can be extended under different system variables such as batch size and floating point precision.","PeriodicalId":119363,"journal":{"name":"2022 IEEE 29th International Conference on High Performance Computing, Data, and Analytics (HiPC)","volume":"30 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133407326","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Ryan Kirkpatrick, Christopher Brown, Vladimir Janjic
{"title":"COMPROF and COMPLACE: Shared-Memory Communication Profiling and Automated Thread Placement via Dynamic Binary Instrumentation","authors":"Ryan Kirkpatrick, Christopher Brown, Vladimir Janjic","doi":"10.1109/HiPC56025.2022.00040","DOIUrl":"https://doi.org/10.1109/HiPC56025.2022.00040","url":null,"abstract":"This paper presents COMPROF and COMPLACE, a novel profiling tool and thread placement technique for shared-memory architectures that requires no recompilation or user intervention. We use dynamic binary instrumentation to intercept memory operations and estimate inter-thread communication overhead, deriving (and possibly visualising) a communication graph of data-sharing between threads. We then use this graph to map threads to cores in order to optimise memory traffic through the memory system. Different paths through a system’s memory hierarchy have different latency, throughput and energy properties, COMPLACE exploits this heterogeneity to provide automatic performance and energy improvements for multithreaded programs. We demonstrate COMPLACE on the NAS Parallel Benchmark (NPB) suite where, using our technique, we are able to achieve improvements of up to 12% in the execution time and up to 10% in the energy consumption (compared to default Linux scheduling) while not requiring any modification or recompilation of the application code.","PeriodicalId":119363,"journal":{"name":"2022 IEEE 29th International Conference on High Performance Computing, Data, and Analytics (HiPC)","volume":"12 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115121152","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Qinghua Zhou, Quentin G. Anthony, A. Shafi, H. Subramoni, D. Panda
{"title":"Accelerating Broadcast Communication with GPU Compression for Deep Learning Workloads","authors":"Qinghua Zhou, Quentin G. Anthony, A. Shafi, H. Subramoni, D. Panda","doi":"10.1109/HiPC56025.2022.00016","DOIUrl":"https://doi.org/10.1109/HiPC56025.2022.00016","url":null,"abstract":"With the rapidly increasing model sizes, state-of-the-art Deep Learning (DL) models rely on multiple GPU nodes to run distributed training. Large message communication of GPU data between the GPUs is becoming a performance bottleneck in the overall training performance. GPU-Aware MPI libraries are widely adopted for state-of-the-art DL frameworks to improve communication performance. In the existing optimization solutions for Distributed Data-Parallel (DDP) training, the broadcast operation is often utilized to sync up the updated model parameters among all the GPUs. However, for state-of-the-art GPU-Aware MPI libraries, broadcasting large GPU data turns to overburden the training performance due to the limited bandwidth of interconnect between the GPU nodes. On the other hand, the recent research on using GPU-based compression libraries to lower the pressure on the nearly saturated interconnection and co-designing online compression with the communication pattern provides a new perspective to optimize the performance of broadcast on modern GPU clusters.In this paper, we redesign the GPU-Aware MPI library to enable efficient collective-level online compression with an optimized chunked-chain scheme for large message broadcast communication. The proposed design is evaluated to show benefits at both microbenchmark and application levels. At the microbenchmark level, the proposed design can reduce the broadcast communication latency by up to 80.9% compared to the baseline using a state-of-the-art MPI library and 55.1% compared to the existing point-to-point-based compression on modern GPU clusters. For DDP training with PyTorch, the proposed design reduces the training time by up to 15.0% and 6.4% compared to the existing chunked-chain scheme and point-to-point-based compression, respectively, while keeping similar training accuracy. To the best of our knowledge, this is the first work that leverages online GPU-based compression techniques to significantly accelerate broadcast communication for DL workloads.","PeriodicalId":119363,"journal":{"name":"2022 IEEE 29th International Conference on High Performance Computing, Data, and Analytics (HiPC)","volume":"11 8","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"120972327","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Keynote 4: Per Stenstr̈m","authors":"","doi":"10.1109/hipc56025.2022.00013","DOIUrl":"https://doi.org/10.1109/hipc56025.2022.00013","url":null,"abstract":"","PeriodicalId":119363,"journal":{"name":"2022 IEEE 29th International Conference on High Performance Computing, Data, and Analytics (HiPC)","volume":"35 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116341100","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}