Proceedings of the 26th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming最新文献_第2页

A fast work-efficient SSSP algorithm for GPUs 一种快速高效的gpu SSSP算法

Proceedings of the 26th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming Pub Date : 2021-02-17 DOI: 10.1145/3437801.3441605

Kai Wang, D. Fussell, Calvin Lin

{"title":"A fast work-efficient SSSP algorithm for GPUs","authors":"Kai Wang, D. Fussell, Calvin Lin","doi":"10.1145/3437801.3441605","DOIUrl":"https://doi.org/10.1145/3437801.3441605","url":null,"abstract":"This paper presents a new Single Source Shortest Path (SSSP) algorithm for GPUs. Our key advancement is an improved work scheduler, which is central to the performance of SSSP algorithms. Previous GPU solutions for SSSP use simple work schedulers that can be implemented efficiently on GPUs but that produce low quality schedules. Such solutions yield poor work efficiency and can underutilize the hardware due to a lack of parallelism. Our solution introduces a more sophisticated work scheduler---based on a novel highly parallel approximate priority queue---that produces high quality schedules while being efficiently implementable on GPUs. To evaluate our solution, we use 226 graph inputs from the Lonestar 4.0 benchmark suite and the SuiteSparse Matrix Collection, and we find that our solution outperforms the previous state-of-the-art solution by an average of 2.9×, showing that an efficient work scheduling mechanism can be implemented on GPUs without sacrificing schedule quality. While this paper focuses on the SSSP problem, it has broader implications for the use of GPUs, illustrating that seemingly ill-suited data structures, such as priority queues, can be efficiently implemented for GPUs if we use the proper software structure.","PeriodicalId":124852,"journal":{"name":"Proceedings of the 26th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming","volume":"15 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-02-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128262255","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 12

GPTune GPTune

Proceedings of the 26th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming Pub Date : 2021-02-17 DOI: 10.1145/3437801.3441621

Yang Liu, Wissam M. Sid-Lakhdar, O. Marques, Xinran Zhu, Chang Meng, J. Demmel, X. Li

引用次数: 30

Improving communication by optimizing on-node data movement with data layout 通过优化节点上的数据移动和数据布局来改善通信

Proceedings of the 26th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming Pub Date : 2021-02-17 DOI: 10.1145/3437801.3441598

Tuowen Zhao, Mary W. Hall, H. Johansen, Samuel Williams

{"title":"Improving communication by optimizing on-node data movement with data layout","authors":"Tuowen Zhao, Mary W. Hall, H. Johansen, Samuel Williams","doi":"10.1145/3437801.3441598","DOIUrl":"https://doi.org/10.1145/3437801.3441598","url":null,"abstract":"We present optimizations to improve communication performance by reducing on-node data movement for a class of distributed memory applications. The primary concept is to eliminate the data movement associated with packing and unpacking subsets of the data during communication. With the rapid rise in network injection bandwidth reducing off-node data movement cost, on-node data movement can be significantly more expensive than computation and network communication. This data movement is especially costly for small domains - as in memory-intensive multi-physics codes or when strong scaling to reduce time-to-solution. The optimizations presented include (1) optimizing data layout through indirection to enable pack-free communication; (2) creating contiguous views of memory using memory mapping thus minimizing the number of messages; and (3) applying these techniques to intra-node data movement including CPU-GPU data movement. The benefits of these optimizations are demonstrated in stencil benchmarks against a highly-optimized baseline, reducing communication time by up to 14.4×.","PeriodicalId":124852,"journal":{"name":"Proceedings of the 26th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming","volume":"307 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-02-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116799294","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 5

Constant-time snapshots with applications to concurrent data structures 应用程序的恒定时间快照到并发数据结构

Proceedings of the 26th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming Pub Date : 2021-02-17 DOI: 10.1145/3437801.3441602

Yuanhao Wei, N. Ben-David, G. Blelloch, P. Fatourou, E. Ruppert, Yihan Sun

{"title":"Constant-time snapshots with applications to concurrent data structures","authors":"Yuanhao Wei, N. Ben-David, G. Blelloch, P. Fatourou, E. Ruppert, Yihan Sun","doi":"10.1145/3437801.3441602","DOIUrl":"https://doi.org/10.1145/3437801.3441602","url":null,"abstract":"Given a concurrent data structure, we present an approach for efficiently taking snapshots of its constituent CAS objects. More specifically, we support a constant-time operation that returns a snapshot handle. This snapshot handle can later be used to read the value of any base object at the time the snapshot was taken. Reading an earlier version of a base object is wait-free and takes time proportional to the number of successful writes to the object since the snapshot was taken. Importantly, our approach preserves all the time bounds and parallelism of the original data structure. Our fast, flexible snapshots yield simple, efficient implementations of atomic multi-point queries on a large class of concurrent data structures. For example, in a search tree where child pointers are updated using CAS, once a snapshot is taken, one can atomically search for ranges of keys, find the first key that matches some criteria, or check if a collection of keys are all present, simply by running a standard sequential algorithm on a snapshot of the tree. To evaluate the performance of our approach, we apply it to three search trees, one balanced and two not. Experiments show that the overhead of supporting snapshots is low across a variety of workloads. Moreover, in almost all cases, range queries on the trees built from our snapshots perform as well as or better than state-of-the-art concurrent data structures that support atomic range queries.","PeriodicalId":124852,"journal":{"name":"Proceedings of the 26th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming","volume":"114 1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-02-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124181918","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 21

Compiler support for near data computing 编译器支持近数据计算

Proceedings of the 26th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming Pub Date : 2021-02-17 DOI: 10.1145/3437801.3441600

M. Kandemir, Jihyun Ryoo, Xulong Tang, Mustafa Karaköy

{"title":"Compiler support for near data computing","authors":"M. Kandemir, Jihyun Ryoo, Xulong Tang, Mustafa Karaköy","doi":"10.1145/3437801.3441600","DOIUrl":"https://doi.org/10.1145/3437801.3441600","url":null,"abstract":"Recent works from both hardware and software domains offer various optimizations that try to take advantage of near data computing (NDC) opportunities. While the results from these works indicate performance improvements of various magnitudes, the existing literature lacks a detailed quantification of the potential of NDC and analysis of compiler optimizations on tapping into that potential. This paper first presents an analysis of the NDC potential when executing multithreaded applications on manycore platforms. It then presents two compiler schemes designed to take advantage of NDC. The first of these schemes try to increase the amount of computation that can be performed in a hardware component, whereas the second compiler strategy strikes a balance between optimizing NDC and exploiting data reuse, by being more selective on when to perform NDC (even if the opportunity presents itself) and how. The collected experimental results on a 5×5 manycore system reveal that our first and second compiler schemes improve the overall performance of our multithreaded applications by, respectively, 22.5% and 25.2%, on average. Furthermore, these two compiler schemes are only 6.8% and 4.1% worse than an oracle scheme that makes the best near data computing decisions for each and every computation.","PeriodicalId":124852,"journal":{"name":"Proceedings of the 26th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming","volume":"35 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-02-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121884196","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 8

Understanding and bridging the gaps in current GNN performance optimizations 理解并弥合当前GNN性能优化中的差距

Proceedings of the 26th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming Pub Date : 2021-02-17 DOI: 10.1145/3437801.3441585

Kezhao Huang, Jidong Zhai, Zhen Zheng, Youngmin Yi, Xipeng Shen

引用次数: 47

Efficient algorithms for persistent transactional memory 持久事务性内存的高效算法

Proceedings of the 26th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming Pub Date : 2021-02-17 DOI: 10.1145/3437801.3441586

P. Ramalhete, Andreia Correia, P. Felber

{"title":"Efficient algorithms for persistent transactional memory","authors":"P. Ramalhete, Andreia Correia, P. Felber","doi":"10.1145/3437801.3441586","DOIUrl":"https://doi.org/10.1145/3437801.3441586","url":null,"abstract":"Durable techniques coupled with transactional semantics provide to application developers the guarantee that data is saved consistently in persistent memory (PM), even in the event of a non-corrupting failure. Persistence fences and flush instructions are known to have a significant impact on the throughput of persistent transactions. In this paper we explore different trade-offs in terms of memory usage vs. number of fences and flushes. We present two new algorithms, named Trinity and Quadra, for durable transactions on PM and implement each of them in the form of a user-level library persistent transactional memory (PTM). Quadra achieves the lower bound with respect to the number of persistence fences and executes one flush instruction per modified cache line. Trinity can be easily combined with concurrency control techniques based on fine grain locking, and we have integrated it with our TL2 adaptation, with eager locking and write-through update strategy. Moreover, the combination of Trinity and TL2 into a PTM provides good scalability for data structures and workloads with a disjoint access pattern. We used this disjoint PTM to implement a key-value (KV) store with durable linearizable transactions. When compared with previous work, our TL2 KV store provides better throughput in nearly all experiments.","PeriodicalId":124852,"journal":{"name":"Proceedings of the 26th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming","volume":"18 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-02-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131545546","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 19

Dynamic scaling for low-precision learning 低精度学习的动态缩放

Proceedings of the 26th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming Pub Date : 2021-02-17 DOI: 10.1145/3437801.3441624

Ruobing Han, Min Si, J. Demmel, Yang You

{"title":"Dynamic scaling for low-precision learning","authors":"Ruobing Han, Min Si, J. Demmel, Yang You","doi":"10.1145/3437801.3441624","DOIUrl":"https://doi.org/10.1145/3437801.3441624","url":null,"abstract":"In recent years, distributed deep learning is becoming popular in industry and academia. Although researchers want to use distributed systems for training, it has been reported that the communication cost for synchronizing gradients can be a bottleneck. Using low-precision gradients is a promising technique for reducing the bandwidth requirement. In this work, we propose Auto Precision Scaling (APS), an algorithm that can improve the accuracy when we communicate gradients by low-precision floating-point values. APS can improve the accuracy for all precisions with a trivial communication cost. Our experimental results show that for both image classification and segmentation, applying APS can train the state-of-the-art models by 8-bit floating-point gradients with no or only a tiny accuracy loss (<0.05%). Furthermore, we can avoid any accuracy loss by designing a hybrid-precision technique. Finally, we propose a performance model to evaluate the proposed method. Our experimental results show that APS can get a significant speedup over the state-of-the-art method. To make it available to researchers and developers, we design and implement a high-performance system for customized precision Deep Learning(CPD), which can simulate the training process using an arbitrary low-precision customized floating-point format. We integrate CPD into PyTorch and make it open-source to the public1.","PeriodicalId":124852,"journal":{"name":"Proceedings of the 26th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming","volume":"217 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-02-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116062966","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

FFT blitz: the tensor cores strike back FFT闪电战:张量核心反击

Proceedings of the 26th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming Pub Date : 2021-02-17 DOI: 10.1145/3437801.3441623

Sultan Durrani, Muhammad Saad Chughtai, Abdul Dakkak, Wen-mei W. Hwu, Lawrence Rauchwerger

引用次数: 6

DFOGraph DFOGraph

Proceedings of the 26th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming Pub Date : 2021-02-17 DOI: 10.1145/3437801.3441622

Jiping Yu, W. Qin, Xiaowei Zhu, Zhenbo Sun, Jianqiang Huang, Xiaohan Li, Wenguang Chen

引用次数: 2