2022 IEEE 29th International Conference on High Performance Computing, Data, and Analytics (HiPC)最新文献

筛选
英文 中文
IRIS-BLAS: Towards a Performance Portable and Heterogeneous BLAS Library IRIS-BLAS:迈向性能可移植和异构的BLAS库
Narasinga Rao Miniskar, Mohammad Alaul Haque Monil, Pedro Valero-Lara, Frank Liu, J. Vetter
{"title":"IRIS-BLAS: Towards a Performance Portable and Heterogeneous BLAS Library","authors":"Narasinga Rao Miniskar, Mohammad Alaul Haque Monil, Pedro Valero-Lara, Frank Liu, J. Vetter","doi":"10.1109/HiPC56025.2022.00042","DOIUrl":"https://doi.org/10.1109/HiPC56025.2022.00042","url":null,"abstract":"This paper presents IRIS-BLAS, a novel heterogeneous and performance portable BLAS library. IRIS-BLAS is built on top of the IRIS runtime and multiple vendor and open-source BLAS libraries. It can transparently use all the architectures/devices available in a heterogeneous system, using the appropriate BLAS library based on the task mapping at run time. Thus, IRIS-BLAS is portable across a broad spectrum of architectures and BLAS libraries, alleviating the worry of application developers about modifying the application source code. Even though the emphasis is on portability, IRIS-BLAS provides competitive or even better performance than other state-of-the-art references. Moreover, IRIS-BLAS offers new features such as efficiently using extremely heterogeneous systems composed of multiple GPUs from different hardware vendors.","PeriodicalId":119363,"journal":{"name":"2022 IEEE 29th International Conference on High Performance Computing, Data, and Analytics (HiPC)","volume":"22 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130463466","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
Parallel Vertex Color Update on Large Dynamic Networks 大型动态网络的并行顶点颜色更新
A. Khanda, S. Bhowmick, Xin Liang, Sajal K. Das
{"title":"Parallel Vertex Color Update on Large Dynamic Networks","authors":"A. Khanda, S. Bhowmick, Xin Liang, Sajal K. Das","doi":"10.1109/HiPC56025.2022.00027","DOIUrl":"https://doi.org/10.1109/HiPC56025.2022.00027","url":null,"abstract":"We present the first GPU-based parallel algorithm to efficiently update vertex coloring on large dynamic networks. For single GPU, we introduce the concept of loosely maintained vertex color update that reduces computation and memory requirements. For multiple GPUs, in distributed environments, we propose priority-based ordering of vertices to reduce the communication time. We prove the correctness of our algorithms and experimentally demonstrate that for graphs of over 16 million vertices and over 134 million edges on a single GPU, our dynamic algorithm is as much as 20x faster than state-of-the-art algorithm on static graphs. For larger graphs with over 130 million vertices and over 260 million edges, our distributed implementation with 8 GPUs produces updated color assignments within 160 milliseconds. In all cases, the proposed parallel algorithms produce comparable or fewer colors than state-of-the-art algorithms.","PeriodicalId":119363,"journal":{"name":"2022 IEEE 29th International Conference on High Performance Computing, Data, and Analytics (HiPC)","volume":"69 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133207522","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
AccDP: Accelerated Data-Parallel Distributed DNN Training for Modern GPU-Based HPC Clusters AccDP:加速数据并行分布式DNN训练现代基于gpu的HPC集群
Nawras Alnaasan, Arpan Jain, A. Shafi, H. Subramoni, D. Panda
{"title":"AccDP: Accelerated Data-Parallel Distributed DNN Training for Modern GPU-Based HPC Clusters","authors":"Nawras Alnaasan, Arpan Jain, A. Shafi, H. Subramoni, D. Panda","doi":"10.1109/HiPC56025.2022.00017","DOIUrl":"https://doi.org/10.1109/HiPC56025.2022.00017","url":null,"abstract":"Deep Learning (DL) has become a prominent machine learning technique due to the availability of efficient computational resources in the form of Graphics Processing Units (GPUs), large-scale datasets and a variety of models. The newer generation of GPUs are being designed with special emphasis on optimizing performance for DL applications. Also, the availability of easy-to-use DL frameworks—like PyTorch and TensorFlow— has enhanced productivity of domain experts to work on their custom DL applications from diverse domains. However, existing Deep Neural Network (DNN) training approaches may not fully utilize the newly emerging powerful GPUs like the NVIDIA A100—this is the primary issue that we address in this paper. Our motivating analyses show that the GPU utilization on NVIDIA A100 can be as low as 43% using traditional DNN training approaches for small-to-medium DL models and input data size. This paper proposes AccDP—a data-parallel distributed DNN training approach—to accelerate GPU-based DL applications. AccDP exploits the Message Passing Interface (MPI) communication library coupled with the NVIDIA’s Multi-Process Service (MPS) to increase the amount of work assigned to parallel GPUs resulting in higher utilization of compute resources. We evaluate our proposed design on different small-to-medium DL models and input sizes on the state-of-the-art HPC clusters. By injecting more parallelism into DNN training using our approach, the evaluation shows up to 58% improvement in training performance on a single GPU and up to 62% on 16 GPUs compared to regular DNN training. Furthermore, we conduct an in-depth characterization to determine the impact of several DNN training factors and best practices—including the batch size and the number of data loading workers— to optimally utilize GPU devices. To the best of our knowledge, this is the first work that explores the use of MPS and MPI to maximize the utilization of GPUs in distributed DNN training.","PeriodicalId":119363,"journal":{"name":"2022 IEEE 29th International Conference on High Performance Computing, Data, and Analytics (HiPC)","volume":"142 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122853150","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Provenance-based Workflow Diagnostics Using Program Specification 使用程序规范的基于来源的工作流诊断
Yuta Nakamura, T. Malik, Iyad A. Kanj, Ashish Gehani
{"title":"Provenance-based Workflow Diagnostics Using Program Specification","authors":"Yuta Nakamura, T. Malik, Iyad A. Kanj, Ashish Gehani","doi":"10.1109/HiPC56025.2022.00046","DOIUrl":"https://doi.org/10.1109/HiPC56025.2022.00046","url":null,"abstract":"Workflow management systems (WMS) help automate and coordinate scientific modules and monitor their execution. WMSes are also used to repeat a workflow application with different inputs to test sensitivity and reproducibility of runs. However, when differences arise in outputs across runs, current WMSes do not audit sufficient provenance metadata to determine where the execution first differed. This increases diagnostic time and leads to poor quality diagnostic results. In this paper, we use program specification to precisely determine locations where workflow execution differs. We use existing provenance audited to isolate modules where execution differs. We show that using program specification comes at some increased storage overhead due to mapping of provenance data flows onto program specification, but leads to better quality diagnostics in terms of the number of differences found and their location relative to comparing provenance metadata audited within current WMSes.","PeriodicalId":119363,"journal":{"name":"2022 IEEE 29th International Conference on High Performance Computing, Data, and Analytics (HiPC)","volume":"104 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126812510","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
memwalkd : Accelerating Key-value stores using Page Table Walkers memwalkd:使用页表漫步器加速键值存储
R. S. Anupindi, Swaroop Kotni, Arkaprava Basu
{"title":"memwalkd : Accelerating Key-value stores using Page Table Walkers","authors":"R. S. Anupindi, Swaroop Kotni, Arkaprava Basu","doi":"10.1109/HiPC56025.2022.00021","DOIUrl":"https://doi.org/10.1109/HiPC56025.2022.00021","url":null,"abstract":"In-memory key-value stores (KVS) or caches form the backbone of many commercial and HPC applications. The basic operation of KVS revolves around storing or updating the mapping from keys to their corresponding values and looking up that mapping when requested by a client. We observe that the memory management unit (MMU) in modern processors does something similar – it looks up the mapping between virtual addresses and physical addresses stored in the per-process page table. We leverage the MMU to gain hardware acceleration for key-value lookup for free in a new key-value store design called memwalkd. We hash keys to unique virtual addresses. These addresses map to the physical addresses that hold the corresponding values. Thus, GET/SETs are performed by simply issuing loads/stores to the hash of a key. Across a wide range of workloads, memwalkd achieves 1.8× better throughput over a highly-optimized implementation of memcached called MICA [1].","PeriodicalId":119363,"journal":{"name":"2022 IEEE 29th International Conference on High Performance Computing, Data, and Analytics (HiPC)","volume":"41 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134347931","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
EECAAP: Efficient Edge-Computing based Anonymous Authentication Protocol for IoV EECAAP:基于高效边缘计算的车联网匿名认证协议
Himani Sikarwar, D. Das
{"title":"EECAAP: Efficient Edge-Computing based Anonymous Authentication Protocol for IoV","authors":"Himani Sikarwar, D. Das","doi":"10.1109/HiPC56025.2022.00047","DOIUrl":"https://doi.org/10.1109/HiPC56025.2022.00047","url":null,"abstract":"Traditional security solutions for the edge have lots of challenges like high power consumption, communication, and computation overhead. These solutions are not feasible for highly dynamic resource-constrained (memory and processing power) Internet of Vehicles (IoV) networks. We can use Physical Unclonable Functions (PUFs) to address this issue. This paper discusses a new PUF-based anonymous mutual authentication and key exchange protocol for the IoV communication environment by combining unique, unpredictable PUFs and one-way hash functions. In the proposed protocol, a three-layered infrastructure, i.e., vehicle layer, edge computing layer, and cloud layer, is used for the IoV networks to make it efficient and improve the throughput and Quality of Service (QoS). The proposed hybrid system provides a security solution for cloning, side-channel and physical attacks along with the less computation and negligible storage cost. The implementation and performance analysis shows that the proposed protocol reduces the computation-communication overhead maximum upto 80% and 60% respectively. It also reduces the time complexity.","PeriodicalId":119363,"journal":{"name":"2022 IEEE 29th International Conference on High Performance Computing, Data, and Analytics (HiPC)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129947296","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
HiPC 2022 Technical Program Committee 2022年重债穷国技术计划委员会
{"title":"HiPC 2022 Technical Program Committee","authors":"","doi":"10.1109/hipc56025.2022.00009","DOIUrl":"https://doi.org/10.1109/hipc56025.2022.00009","url":null,"abstract":"","PeriodicalId":119363,"journal":{"name":"2022 IEEE 29th International Conference on High Performance Computing, Data, and Analytics (HiPC)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114453980","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Scaling the SOO Global Blackbox Optimizer on a 128-core Architecture 在128核架构上扩展SOO全局黑盒优化器
David Redon, B. Derbel, P. Fortin
{"title":"Scaling the SOO Global Blackbox Optimizer on a 128-core Architecture","authors":"David Redon, B. Derbel, P. Fortin","doi":"10.1109/HiPC56025.2022.00037","DOIUrl":"https://doi.org/10.1109/HiPC56025.2022.00037","url":null,"abstract":"Blackbox optimization refers to the situation where no analytical knowledge about the problem is available beforehand, which is the case in a number of application fields, e.g., multi-disciplinary design, simulation optimization. In this context, the so-called Simultaneous Optimistic Optimization (SOO) algorithm is a deterministic tree-based global optimizer exposing theoretically provable performance guarantees under mild conditions. In this paper, we consider the efficient shared-memory parallelization of SOO on a high-end HPC architecture with dozens of CPU cores. We thereby propose different strategies based on eliciting the possible levels of parallelism underlying the SOO algorithm. We show that the naive approach, performing multiple evaluations of the blackbox function in parallel, does not scale with the number of cores. By contrast, we show that a parallel design based on the SOO-tree traversal is able to provide substantial improvements in terms of scalability and performance. We validate our strategies with a detailed performance analysis on a compute server with two 64-core processors, using a number of diverse benchmark functions with both increasing dimensions and number of cores.","PeriodicalId":119363,"journal":{"name":"2022 IEEE 29th International Conference on High Performance Computing, Data, and Analytics (HiPC)","volume":"94 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131019523","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Towards Efficient Cache Allocation for High-Frequency Checkpointing 面向高频检查点的高效缓存分配
Avinash Maurya, Bogdan Nicolae, M. M. Rafique, Amr M. Elsayed, T. Tonellot, F. Cappello
{"title":"Towards Efficient Cache Allocation for High-Frequency Checkpointing","authors":"Avinash Maurya, Bogdan Nicolae, M. M. Rafique, Amr M. Elsayed, T. Tonellot, F. Cappello","doi":"10.1109/HiPC56025.2022.00043","DOIUrl":"https://doi.org/10.1109/HiPC56025.2022.00043","url":null,"abstract":"While many HPC applications are known to have long runtimes, this is not always because of single large runs: in many cases, this is due to ensembles composed of many short runs (runtime in the order of minutes). When each such run needs to checkpoint frequently (e.g. adjoint computations using a checkpoint interval in the order of milliseconds), it is important to minimize both checkpointing overheads at each iteration, as well as initialization overheads. With the rising popularity of GPUs, minimizing both overheads simultaneously is challenging: while it is possible to take advantage of efficient asynchronous data transfers between GPU and host memory, this comes at the cost of high initialization overhead needed to allocate and pin host memory. In this paper, we contribute with an efficient technique to address this challenge. The key idea is to use an adaptive approach that delays the pinning of the host memory buffer holding the checkpoints until all memory pages are touched, which greatly reduces the overhead of registering the host memory with the CUDA driver. To this end, we use a combination of asynchronous touching of memory pages and direct writes of checkpoints to untouched and touched memory pages in order to minimize end-to-end checkpointing overheads based on performance modeling. Our evaluations show a significant improvement over a variety of alternative static allocation strategies and state-of-art approaches.","PeriodicalId":119363,"journal":{"name":"2022 IEEE 29th International Conference on High Performance Computing, Data, and Analytics (HiPC)","volume":"140 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123396915","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Joint Partitioning and Sampling Algorithm for Scaling Graph Neural Network 缩放图神经网络的联合划分和抽样算法
Manohar Lal Das, Vishwesh Jatala, Gagan Raj Gupta
{"title":"Joint Partitioning and Sampling Algorithm for Scaling Graph Neural Network","authors":"Manohar Lal Das, Vishwesh Jatala, Gagan Raj Gupta","doi":"10.1109/HiPC56025.2022.00018","DOIUrl":"https://doi.org/10.1109/HiPC56025.2022.00018","url":null,"abstract":"Graph Neural Network (GNN) has emerged as a popular toolbox for solving complex problems on graph data structures. Graph neural networks use machine learning techniques to learn the vector representations of nodes and/or edges. Learning these representations demands a huge amount of memory and computing power. The traditional shared-memory multiprocessors are insufficient to meet real-world data’s computing requirements; hence, research has gained momentum toward distributed GNN.Scaling the distributed GNN has the following challenges: (1) the input graph needs to be efficiently partitioned, (2) the cost of communication between compute nodes should be reduced, and (3) the sampling strategy should be efficiently chosen to minimize the loss in accuracy. To address these challenges, we propose a joint partitioning and sampling algorithm, which partitions the input graph with weighted METIS and uses a bias sampling strategy to minimize total communication costs.We implemented our approach using the DistDGL framework and evaluated it using several real-world datasets. We observe that our approach (1) shows an average reduction in communication overhead by 53%, (2) requires less partitioning time to partition a graph, (3) shows improved accuracy, (4) shows a speed up of 1.5x on OGB-Arxiv dataset, when compared to the state-of-the-art DistDGL implementation.","PeriodicalId":119363,"journal":{"name":"2022 IEEE 29th International Conference on High Performance Computing, Data, and Analytics (HiPC)","volume":"275 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122811900","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
相关产品
×
本文献相关产品
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信