{"title":"A fast work-efficient SSSP algorithm for GPUs","authors":"Kai Wang, D. Fussell, Calvin Lin","doi":"10.1145/3437801.3441605","DOIUrl":"https://doi.org/10.1145/3437801.3441605","url":null,"abstract":"This paper presents a new Single Source Shortest Path (SSSP) algorithm for GPUs. Our key advancement is an improved work scheduler, which is central to the performance of SSSP algorithms. Previous GPU solutions for SSSP use simple work schedulers that can be implemented efficiently on GPUs but that produce low quality schedules. Such solutions yield poor work efficiency and can underutilize the hardware due to a lack of parallelism. Our solution introduces a more sophisticated work scheduler---based on a novel highly parallel approximate priority queue---that produces high quality schedules while being efficiently implementable on GPUs. To evaluate our solution, we use 226 graph inputs from the Lonestar 4.0 benchmark suite and the SuiteSparse Matrix Collection, and we find that our solution outperforms the previous state-of-the-art solution by an average of 2.9×, showing that an efficient work scheduling mechanism can be implemented on GPUs without sacrificing schedule quality. While this paper focuses on the SSSP problem, it has broader implications for the use of GPUs, illustrating that seemingly ill-suited data structures, such as priority queues, can be efficiently implemented for GPUs if we use the proper software structure.","PeriodicalId":124852,"journal":{"name":"Proceedings of the 26th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming","volume":"15 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-02-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128262255","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yang Liu, Wissam M. Sid-Lakhdar, O. Marques, Xinran Zhu, Chang Meng, J. Demmel, X. Li
{"title":"GPTune","authors":"Yang Liu, Wissam M. Sid-Lakhdar, O. Marques, Xinran Zhu, Chang Meng, J. Demmel, X. Li","doi":"10.1145/3437801.3441621","DOIUrl":"https://doi.org/10.1145/3437801.3441621","url":null,"abstract":"Multitask learning has proven to be useful in the field of machine learning when additional knowledge is available to help a prediction task. We adapt this paradigm to develop autotuning frameworks, where the objective is to find the optimal performance parameters of an application code that is treated as a black-box function. Furthermore, we combine multitask learning with multi-objective tuning and incorporation of coarse performance models to enhance the tuning capability. The proposed framework is parallelized and applicable to any application, particularly exascale applications with a small number of function evaluations. Compared with other state-of-the-art single-task learning frameworks, the proposed framework attains up to 2.8X better code performance for at least 80% of all tasks using up to 2048 cores.","PeriodicalId":124852,"journal":{"name":"Proceedings of the 26th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming","volume":"3 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-02-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132045326","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Tuowen Zhao, Mary W. Hall, H. Johansen, Samuel Williams
{"title":"Improving communication by optimizing on-node data movement with data layout","authors":"Tuowen Zhao, Mary W. Hall, H. Johansen, Samuel Williams","doi":"10.1145/3437801.3441598","DOIUrl":"https://doi.org/10.1145/3437801.3441598","url":null,"abstract":"We present optimizations to improve communication performance by reducing on-node data movement for a class of distributed memory applications. The primary concept is to eliminate the data movement associated with packing and unpacking subsets of the data during communication. With the rapid rise in network injection bandwidth reducing off-node data movement cost, on-node data movement can be significantly more expensive than computation and network communication. This data movement is especially costly for small domains - as in memory-intensive multi-physics codes or when strong scaling to reduce time-to-solution. The optimizations presented include (1) optimizing data layout through indirection to enable pack-free communication; (2) creating contiguous views of memory using memory mapping thus minimizing the number of messages; and (3) applying these techniques to intra-node data movement including CPU-GPU data movement. The benefits of these optimizations are demonstrated in stencil benchmarks against a highly-optimized baseline, reducing communication time by up to 14.4×.","PeriodicalId":124852,"journal":{"name":"Proceedings of the 26th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming","volume":"307 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-02-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116799294","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yuanhao Wei, N. Ben-David, G. Blelloch, P. Fatourou, E. Ruppert, Yihan Sun
{"title":"Constant-time snapshots with applications to concurrent data structures","authors":"Yuanhao Wei, N. Ben-David, G. Blelloch, P. Fatourou, E. Ruppert, Yihan Sun","doi":"10.1145/3437801.3441602","DOIUrl":"https://doi.org/10.1145/3437801.3441602","url":null,"abstract":"Given a concurrent data structure, we present an approach for efficiently taking snapshots of its constituent CAS objects. More specifically, we support a constant-time operation that returns a snapshot handle. This snapshot handle can later be used to read the value of any base object at the time the snapshot was taken. Reading an earlier version of a base object is wait-free and takes time proportional to the number of successful writes to the object since the snapshot was taken. Importantly, our approach preserves all the time bounds and parallelism of the original data structure. Our fast, flexible snapshots yield simple, efficient implementations of atomic multi-point queries on a large class of concurrent data structures. For example, in a search tree where child pointers are updated using CAS, once a snapshot is taken, one can atomically search for ranges of keys, find the first key that matches some criteria, or check if a collection of keys are all present, simply by running a standard sequential algorithm on a snapshot of the tree. To evaluate the performance of our approach, we apply it to three search trees, one balanced and two not. Experiments show that the overhead of supporting snapshots is low across a variety of workloads. Moreover, in almost all cases, range queries on the trees built from our snapshots perform as well as or better than state-of-the-art concurrent data structures that support atomic range queries.","PeriodicalId":124852,"journal":{"name":"Proceedings of the 26th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming","volume":"114 1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-02-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124181918","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
M. Kandemir, Jihyun Ryoo, Xulong Tang, Mustafa Karaköy
{"title":"Compiler support for near data computing","authors":"M. Kandemir, Jihyun Ryoo, Xulong Tang, Mustafa Karaköy","doi":"10.1145/3437801.3441600","DOIUrl":"https://doi.org/10.1145/3437801.3441600","url":null,"abstract":"Recent works from both hardware and software domains offer various optimizations that try to take advantage of near data computing (NDC) opportunities. While the results from these works indicate performance improvements of various magnitudes, the existing literature lacks a detailed quantification of the potential of NDC and analysis of compiler optimizations on tapping into that potential. This paper first presents an analysis of the NDC potential when executing multithreaded applications on manycore platforms. It then presents two compiler schemes designed to take advantage of NDC. The first of these schemes try to increase the amount of computation that can be performed in a hardware component, whereas the second compiler strategy strikes a balance between optimizing NDC and exploiting data reuse, by being more selective on when to perform NDC (even if the opportunity presents itself) and how. The collected experimental results on a 5×5 manycore system reveal that our first and second compiler schemes improve the overall performance of our multithreaded applications by, respectively, 22.5% and 25.2%, on average. Furthermore, these two compiler schemes are only 6.8% and 4.1% worse than an oracle scheme that makes the best near data computing decisions for each and every computation.","PeriodicalId":124852,"journal":{"name":"Proceedings of the 26th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming","volume":"35 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-02-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121884196","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Understanding and bridging the gaps in current GNN performance optimizations","authors":"Kezhao Huang, Jidong Zhai, Zhen Zheng, Youngmin Yi, Xipeng Shen","doi":"10.1145/3437801.3441585","DOIUrl":"https://doi.org/10.1145/3437801.3441585","url":null,"abstract":"Graph Neural Network (GNN) has recently drawn a rapid increase of interest in many domains for its effectiveness in learning over graphs. Maximizing its performance is essential for many tasks, but remains preliminarily understood. In this work, we provide an in-depth examination of the state-of-the-art GNN frameworks, revealing five major gaps in the current frameworks in optimizing GNN performance, especially in handling the special complexities of GNN over traditional graph or DNN operations. Based on the insights, we put together a set of optimizations to fill the gaps. These optimizations leverage the state-of-the-art GPU optimization techniques and tailor them to the special properties of GNN. Experimental results show that these optimizations achieve 1.37×--15.5× performance improvement over the state-of-the-art frameworks on various GNN models.","PeriodicalId":124852,"journal":{"name":"Proceedings of the 26th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming","volume":"133 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-02-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117192070","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Efficient algorithms for persistent transactional memory","authors":"P. Ramalhete, Andreia Correia, P. Felber","doi":"10.1145/3437801.3441586","DOIUrl":"https://doi.org/10.1145/3437801.3441586","url":null,"abstract":"Durable techniques coupled with transactional semantics provide to application developers the guarantee that data is saved consistently in persistent memory (PM), even in the event of a non-corrupting failure. Persistence fences and flush instructions are known to have a significant impact on the throughput of persistent transactions. In this paper we explore different trade-offs in terms of memory usage vs. number of fences and flushes. We present two new algorithms, named Trinity and Quadra, for durable transactions on PM and implement each of them in the form of a user-level library persistent transactional memory (PTM). Quadra achieves the lower bound with respect to the number of persistence fences and executes one flush instruction per modified cache line. Trinity can be easily combined with concurrency control techniques based on fine grain locking, and we have integrated it with our TL2 adaptation, with eager locking and write-through update strategy. Moreover, the combination of Trinity and TL2 into a PTM provides good scalability for data structures and workloads with a disjoint access pattern. We used this disjoint PTM to implement a key-value (KV) store with durable linearizable transactions. When compared with previous work, our TL2 KV store provides better throughput in nearly all experiments.","PeriodicalId":124852,"journal":{"name":"Proceedings of the 26th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming","volume":"18 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-02-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131545546","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Dynamic scaling for low-precision learning","authors":"Ruobing Han, Min Si, J. Demmel, Yang You","doi":"10.1145/3437801.3441624","DOIUrl":"https://doi.org/10.1145/3437801.3441624","url":null,"abstract":"In recent years, distributed deep learning is becoming popular in industry and academia. Although researchers want to use distributed systems for training, it has been reported that the communication cost for synchronizing gradients can be a bottleneck. Using low-precision gradients is a promising technique for reducing the bandwidth requirement. In this work, we propose Auto Precision Scaling (APS), an algorithm that can improve the accuracy when we communicate gradients by low-precision floating-point values. APS can improve the accuracy for all precisions with a trivial communication cost. Our experimental results show that for both image classification and segmentation, applying APS can train the state-of-the-art models by 8-bit floating-point gradients with no or only a tiny accuracy loss (<0.05%). Furthermore, we can avoid any accuracy loss by designing a hybrid-precision technique. Finally, we propose a performance model to evaluate the proposed method. Our experimental results show that APS can get a significant speedup over the state-of-the-art method. To make it available to researchers and developers, we design and implement a high-performance system for customized precision Deep Learning(CPD), which can simulate the training process using an arbitrary low-precision customized floating-point format. We integrate CPD into PyTorch and make it open-source to the public1.","PeriodicalId":124852,"journal":{"name":"Proceedings of the 26th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming","volume":"217 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-02-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116062966","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Sultan Durrani, Muhammad Saad Chughtai, Abdul Dakkak, Wen-mei W. Hwu, Lawrence Rauchwerger
{"title":"FFT blitz: the tensor cores strike back","authors":"Sultan Durrani, Muhammad Saad Chughtai, Abdul Dakkak, Wen-mei W. Hwu, Lawrence Rauchwerger","doi":"10.1145/3437801.3441623","DOIUrl":"https://doi.org/10.1145/3437801.3441623","url":null,"abstract":"The fast Fourier Transform (FFT), a reduced-complexity formulation of the Discrete Fourier Transform (DFT), is an important tool in many areas of science and engineering. FFTW is a well-known package that follows this approach and is currently one of the fastest available implementations of the FFT. NVIDIA introduced its version of FFTW called cuFFT that achieves high performance on the GPUs. In this work we present a novel way to map the FFT algorithm on the newly introduced Tensor Cores by adapting the the Cooley-Tukey recursive FFT algorithm. We present four major types of optimizations that enhance the performance of our approach for varying FFT sizes and show that the approach consistently outperforms cuFFT with a speedup of about 15% to 250% on average.","PeriodicalId":124852,"journal":{"name":"Proceedings of the 26th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming","volume":"62 4 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-02-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116361000","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"DFOGraph","authors":"Jiping Yu, W. Qin, Xiaowei Zhu, Zhenbo Sun, Jianqiang Huang, Xiaohan Li, Wenguang Chen","doi":"10.1145/3437801.3441622","DOIUrl":"https://doi.org/10.1145/3437801.3441622","url":null,"abstract":"With the magnitude of graph-structured data continually increasing, graph processing systems that can scale-out and scale-up are needed to handle extreme-scale datasets. While existing distributed out-of-core solutions have made it possible, they suffer from limited performance due to excessive I/O and communication costs. We present DFOGraph, a distributed fully-out-of-core graph processing system that applies and assembles multiple techniques to enable I/O- and communication-efficient processing. DFOGraph builds upon two-level partitions with adaptive compressed representations to allow fine-grained selective computation and communication. Our evaluation shows DFOGraph outperforms Chaos and HybridGraph significantly (>12.94× and >10.82×) when scaling out to eight nodes.","PeriodicalId":124852,"journal":{"name":"Proceedings of the 26th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming","volume":"25 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-02-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123149492","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}