Proceedings of the 28th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming最新文献

筛选
英文 中文
PiPAD: Pipelined and Parallel Dynamic GNN Training on GPUs PiPAD: gpu上的流水线和并行动态GNN训练
Chunyang Wang, Desen Sun, Yunru Bai
{"title":"PiPAD: Pipelined and Parallel Dynamic GNN Training on GPUs","authors":"Chunyang Wang, Desen Sun, Yunru Bai","doi":"10.1145/3572848.3577487","DOIUrl":"https://doi.org/10.1145/3572848.3577487","url":null,"abstract":"Dynamic Graph Neural Networks (DGNNs) have been widely applied in various real-life applications, such as link prediction and pandemic forecast, to capture both static structural information and temporal characteristics from dynamic graphs. Combining both time-dependent and -independent components, DGNNs manifest substantial parallel computation and data reuse potentials, but suffer from severe memory access inefficiency and data transfer overhead under the canonical one-graph-at-a-time training pattern. To tackle these challenges, we propose PiPAD, a Pipelined and PArallel DGNN training framework for the end-to-end performance optimization on GPUs. From both algorithm and runtime level, PiPAD holistically reconstructs the overall training paradigm from the data organization to computation manner. Capable of processing multiple graph snapshots in parallel, PiPAD eliminates unnecessary data transmission and alleviates memory access inefficiency to improve the overall performance. Our evaluation across various datasets shows PiPAD achieves 1.22 × --9.57× speedup over the state-of-the-art DGNN frameworks on three representative models.","PeriodicalId":233744,"journal":{"name":"Proceedings of the 28th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming","volume":"80 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127735580","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 5
Practically and Theoretically Efficient Garbage Collection for Multiversioning 实际和理论上多版本有效的垃圾收集
Yuanhao Wei, G. Blelloch, P. Fatourou, E. Ruppert
{"title":"Practically and Theoretically Efficient Garbage Collection for Multiversioning","authors":"Yuanhao Wei, G. Blelloch, P. Fatourou, E. Ruppert","doi":"10.1145/3572848.3577508","DOIUrl":"https://doi.org/10.1145/3572848.3577508","url":null,"abstract":"Multiversioning is widely used in databases, transactional memory, and concurrent data structures. It can be used to support read-only transactions that appear atomic in the presence of concurrent update operations. Any system that maintains multiple versions of each object needs a way of efficiently reclaiming them. We experimentally compare various existing reclamation techniques by applying them to a multiversion tree and a multiversion hash table. Using insights from these experiments, we develop two new multiversion garbage collection (MVGC) techniques. These techniques use two novel concurrent version list data structures. Our experimental evaluation shows that our fastest technique is competitive with the fastest existing MVGC techniques, while using significantly less space on some workloads. Our new techniques provide strong theoretical bounds, especially on space usage. These bounds ensure that the schemes have consistent performance, avoiding the very high worst-case space usage of other techniques.","PeriodicalId":233744,"journal":{"name":"Proceedings of the 28th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming","volume":"82 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-12-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115860499","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
High-Performance Filters for GPUs gpu的高性能过滤器
Hunter McCoy, S. Hofmeyr, K. Yelick, P. Pandey
{"title":"High-Performance Filters for GPUs","authors":"Hunter McCoy, S. Hofmeyr, K. Yelick, P. Pandey","doi":"10.1145/3572848.3577507","DOIUrl":"https://doi.org/10.1145/3572848.3577507","url":null,"abstract":"Filters approximately store a set of items while trading off accuracy for space-efficiency and can address the limited memory on accelerators, such as GPUs. However, there is a lack of high-performance and feature-rich GPU filters as most advancements in filter research has focused on CPUs. In this paper, we explore the design space of filters with a goal to develop massively parallel, high performance, and feature rich filters for GPUs. We evaluate various filter designs in terms of performance, usability, and supported features and identify two filter designs that offer the right trade off in terms of performance, features, and usability. We present two new GPU-based filters, the TCF and GQF, that can be employed in various high performance data analytics applications. The TCF is a set membership filter and supports faster inserts and queries, whereas the GQF supports counting which comes at an additional performance cost. Both the GQF and TCF provide point and bulk insertion API and are designed to exploit the massive parallelism in the GPU without sacrificing usability and necessary features. The TCF and GQF are up to 4.4× and 1.4× faster than the previous GPU filters in our benchmarks and at the same time overcome the fundamental constraints in performance and usability in current GPU filters.","PeriodicalId":233744,"journal":{"name":"Proceedings of the 28th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming","volume":"12 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-12-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121829612","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Fast Parallel Exact Inference on Bayesian Networks 基于贝叶斯网络的快速并行精确推理
Jiantong Jiang, Zeyi Wen, A. Mansoor, A. Mian
{"title":"Fast Parallel Exact Inference on Bayesian Networks","authors":"Jiantong Jiang, Zeyi Wen, A. Mansoor, A. Mian","doi":"10.1145/3572848.3577476","DOIUrl":"https://doi.org/10.1145/3572848.3577476","url":null,"abstract":"Bayesian networks (BNs) are attractive, because they are graphical and interpretable machine learning models. However, exact inference on BNs is time-consuming, especially for complex problems. To improve the efficiency, we propose a fast BN exact inference solution named Fast-BNI on multi-core CPUs. Fast-BNI enhances the efficiency of exact inference through hybrid parallelism that tightly integrates coarse- and fine-grained parallelism. We also propose techniques to further simplify the bottleneck operations of BN exact inference. Fast-BNI source code is freely available at https://github.com/jjiantong/FastBN.","PeriodicalId":233744,"journal":{"name":"Proceedings of the 28th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming","volume":"295 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-12-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133786259","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Unexpected Scaling in Path Copying Trees 路径复制树的意外缩放
I. Kokorin, Alexander Fedorov, Trevor Brown, V. Aksenov
{"title":"Unexpected Scaling in Path Copying Trees","authors":"I. Kokorin, Alexander Fedorov, Trevor Brown, V. Aksenov","doi":"10.1145/3572848.3577512","DOIUrl":"https://doi.org/10.1145/3572848.3577512","url":null,"abstract":"Although a wide variety of handcrafted concurrent data structures have been proposed, there is considerable interest in universal approaches (Universal Constructions or UCs) for building concurrent data structures. UCs (semi-)automatically convert a sequential data structure into a concurrent one. The simplest approach uses locks [3, 6] that protect a sequential data structure and allow only one process to access it at a time. However, the resulting data structure is blocking. Most work on UCs instead focuses on obtaining non-blocking progress guarantees such as obstruction-freedom, lock-freedom or wait-freedom. Many non-blocking UCs have appeared. Key examples include the seminal wait-free UC [2] by Herlihy, a NUMA-aware UC [10] by Yi et al., and an efficient UC for large objects [1] by Fatourou et al.","PeriodicalId":233744,"journal":{"name":"Proceedings of the 28th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming","volume":"79 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114456344","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Fast and Scalable Channels in Kotlin Coroutines Kotlin协程中的快速和可扩展通道
N. Koval, Dan Alistarh, Roman Elizarov
{"title":"Fast and Scalable Channels in Kotlin Coroutines","authors":"N. Koval, Dan Alistarh, Roman Elizarov","doi":"10.1145/3572848.3577481","DOIUrl":"https://doi.org/10.1145/3572848.3577481","url":null,"abstract":"Asynchronous programming has gained significant popularity over the last decade: support for this programming pattern is available in many popular languages via libraries and native language implementations, typically in the form of coroutines or the async/await construct. Instead of programming via shared memory, this concept assumes implicit synchronization through message passing. The key data structure enabling such communication is the rendezvous channel. Roughly, a rendezvous channel is a blocking queue of size zero, so both send(e) and receive() operations wait for each other, performing a rendezvous when they meet. To optimize the message passing pattern, channels are usually equipped with a fixed-size buffer, so sends do not suspend and put elements into the buffer until its capacity is exceeded. This primitive is known as a buffered channel. This paper presents a fast and scalable algorithm for both rendezvous and buffered channels. Similarly to modern queues, our solution is based on an infinite array with two positional counters for send(e) and receive() operations, leveraging the unconditional Fetch-And-Add instruction to update them. Yet, the algorithm requires non-trivial modifications of this classic pattern, in order to support the full channel semantics, such as buffering and cancellation of waiting requests. We compare the performance of our solution to that of the Kotlin implementation, as well as against other academic proposals, showing up to 9.8× speedup. To showcase its expressiveness and performance, we also integrated the proposed algorithm into the standard Kotlin Coroutines library, replacing the previous channel implementations.","PeriodicalId":233744,"journal":{"name":"Proceedings of the 28th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming","volume":"24 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-11-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121028547","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
TDC: Towards Extremely Efficient CNNs on GPUs via Hardware-Aware Tucker Decomposition TDC:通过硬件感知的Tucker分解在gpu上实现极其高效的cnn
Lizhi Xiang, Miao Yin, Chengming Zhang, Aravind Sukumaran-Rajam, P. Sadayappan, Bo Yuan, Dingwen Tao
{"title":"TDC: Towards Extremely Efficient CNNs on GPUs via Hardware-Aware Tucker Decomposition","authors":"Lizhi Xiang, Miao Yin, Chengming Zhang, Aravind Sukumaran-Rajam, P. Sadayappan, Bo Yuan, Dingwen Tao","doi":"10.1145/3572848.3577478","DOIUrl":"https://doi.org/10.1145/3572848.3577478","url":null,"abstract":"Tucker decomposition is one of the SOTA CNN model compression techniques. However, unlike the FLOPs reduction, we observe very limited inference time reduction with Tucker-compressed models using existing GPU software such as cuDNN. To this end, we propose an efficient end-to-end framework that can generate highly accurate and compact CNN models via Tucker decomposition and optimized inference code on GPUs. Specifically, we propose an ADMM-based training algorithm that can achieve highly accurate Tucker-format models. We also develop a high-performance kernel for Tucker-format convolutions and analytical performance models to guide the selection of execution parameters. We further propose a co-design framework to determine the proper Tucker ranks driven by practical inference time (rather than FLOPs). Our evaluation on five modern CNNs with A100 demonstrates that our compressed models with our optimized code achieve up to 2.21× speedup over cuDNN, 1.12× speedup over TVM, and 3.27× over the original models using cuDNN with at most 0.05% accuracy loss.","PeriodicalId":233744,"journal":{"name":"Proceedings of the 28th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming","volume":"119 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-11-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114080533","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 5
Efficient All-Reduce for Distributed DNN Training in Optical Interconnect Systems 光学互连系统分布式DNN训练的高效全约简
Fei Dai, Yawen Chen, Zhiyi Huang, Haibo Zhang, Fangfang Zhang
{"title":"Efficient All-Reduce for Distributed DNN Training in Optical Interconnect Systems","authors":"Fei Dai, Yawen Chen, Zhiyi Huang, Haibo Zhang, Fangfang Zhang","doi":"10.1145/3572848.3577391","DOIUrl":"https://doi.org/10.1145/3572848.3577391","url":null,"abstract":"All-reduce is the crucial communication primitive to reduce model parameters in distributed Deep Neural Networks (DNN) training. Most existing all-reduce algorithms are designed for traditional electrical interconnect systems, which cannot meet the communication requirements for distributed training of large DNNs due to the low data bandwidth of the electrical interconnect systems. One of the promising alternatives for electrical interconnect is optical interconnect, which can provide high bandwidth, low transmission delay, and low power cost. We propose an efficient scheme called WRHT (Wavelength Reused Hierarchical Tree) for implementing all-reduce operation in optical interconnect systems. WRHT can take advantage of WDM (Wavelength Division Multiplexing) to reduce the communication time of distributed data-parallel DNN training. Simulations using real DNN models show that, compared to all-reduce algorithms in the electrical and optical network systems, our approach reduces communication time by 75.76% and 91.86%, respectively.","PeriodicalId":233744,"journal":{"name":"Proceedings of the 28th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming","volume":"14 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-07-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132156503","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
High-Performance GPU-to-CPU Transpilation and Optimization via High-Level Parallel Constructs 基于高级并行结构的高性能gpu到cpu的转译和优化
William S. Moses, Ivan R. Ivanov, Jens Domke, Toshio Endo, J. Doerfert, O. Zinenko
{"title":"High-Performance GPU-to-CPU Transpilation and Optimization via High-Level Parallel Constructs","authors":"William S. Moses, Ivan R. Ivanov, Jens Domke, Toshio Endo, J. Doerfert, O. Zinenko","doi":"10.1145/3572848.3577475","DOIUrl":"https://doi.org/10.1145/3572848.3577475","url":null,"abstract":"While parallelism remains the main source of performance, architectural implementations and programming models change with each new hardware generation, often leading to costly application re-engineering. Most tools for performance portability require manual and costly application porting to yet another programming model. We propose an alternative approach that automatically translates programs written in one programming model (CUDA), into another (CPU threads) based on Polygeist/MLIR. Our approach includes a representation of parallel constructs that allows conventional compiler transformations to apply transparently and without modification and enables parallelism-specific optimizations. We evaluate our framework by transpiling and optimizing the CUDA Rodinia benchmark suite for a multi-core CPU and achieve a 58% geomean speedup over handwritten OpenMP code. Further, we show how CUDA kernels from PyTorch can efficiently run and scale on the CPU-only Supercomputer Fugaku without user intervention. Our PyTorch compatibility layer making use of transpiled CUDA PyTorch kernels outperforms the PyTorch CPU native backend by 2.7×.","PeriodicalId":233744,"journal":{"name":"Proceedings of the 28th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming","volume":"60 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121806643","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
Lifetime-Based Optimization for Simulating Quantum Circuits on a New Sunway Supercomputer 新型神威超级计算机上基于寿命的量子电路模拟优化
Yaojian Chen, Yong Liu, X. Shi, Jiawei Song, Xin Liu, L. Gan, Chunyi Guo, H. Fu, Jie Gao, Dexun Chen, Guangwen Yang
{"title":"Lifetime-Based Optimization for Simulating Quantum Circuits on a New Sunway Supercomputer","authors":"Yaojian Chen, Yong Liu, X. Shi, Jiawei Song, Xin Liu, L. Gan, Chunyi Guo, H. Fu, Jie Gao, Dexun Chen, Guangwen Yang","doi":"10.1145/3572848.3577529","DOIUrl":"https://doi.org/10.1145/3572848.3577529","url":null,"abstract":"High-performance classical simulator for quantum circuits, in particular the tensor network contraction algorithm, has become an important tool for the validation of noisy quantum computing. In order to address the memory limitations, the slicing technique is used to reduce the tensor dimensions, but it could also lead to additional computation overhead that greatly slows down the overall performance. This paper proposes novel lifetime-based methods to reduce the slicing overhead and improve the computing efficiency, including, an interpretation method to deal with slicing overhead, an inplace slicing strategy to find the smallest slicing set and an adaptive tensor network contraction path refiner customized for Sunway architecture. Experiments show that in most cases the slicing overhead with our inplace slicing strategy would be less than the Cotengra, which is the most used graph path optimization software at present. Finally, the resulting simulation time is reduced to 96.1s for the Sycamore quantum processor RQC, with a sustainable single-precision performance of 308.6Pflops using over 41M cores to generate 1M correlated samples, which is more than 5 times performance improvement compared to 60.4 Pflops in 2021 Gordon Bell Prize work.","PeriodicalId":233744,"journal":{"name":"Proceedings of the 28th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming","volume":"111 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124705881","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
相关产品
×
本文献相关产品
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信