Proceedings of the 34th ACM International Conference on Supercomputing最新文献_第4页

BurstZ

Proceedings of the 34th ACM International Conference on Supercomputing Pub Date : 2020-06-29 DOI: 10.1145/3392717.3392746

Gongjin Sun, Seongyoung Kang, S. Jun

{"title":"BurstZ","authors":"Gongjin Sun, Seongyoung Kang, S. Jun","doi":"10.1145/3392717.3392746","DOIUrl":"https://doi.org/10.1145/3392717.3392746","url":null,"abstract":"We present BurstZ, a bandwidth-efficient accelerator platform for scientific computing. While accelerators such as GPUs and FPGAs provide enormous computing capabilities, their effectiveness quickly deteriorates once the working set becomes larger than the on-board memory capacity, causing the performance to become bottlenecked either by the communication bandwidth between the host and the accelerator. Compression has not been very useful in solving this issue due to the difficulty of efficiently compressing floating point numbers, which scientific data often consists of. Most compression algorithms are either ineffective with floating point numbers, or has a high performance overhead. BurstZ is an FPGA-based accelerator platform which addresses the bandwidth issue via a novel hardware-optimized floating point compression algorithm, which we call sZFP. We demonstrate that BurstZ can completely remove the communication bottleneck for accelerators, using a 3D stencil-code accelerator implemented on a prototype BurstZ implementation. Evaluated against hand-optimized implementations of stencil code accelerators of the same architecture, our BurstZ prototype outperformed an accelerator without compression by almost 4X, and even an accelerator with enough memory for the entire dataset by over 2X. BurstZ improved communication efficiency so much, our prototype was even able to outperform the upper limit projected performance of an optimized stencil core with ideal memory access characteristics, by over 2X.","PeriodicalId":346687,"journal":{"name":"Proceedings of the 34th ACM International Conference on Supercomputing","volume":"19 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-06-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132668566","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 10

Fuzzy fairness controller for NVMe SSDs NVMe固态硬盘模糊公平性控制器

Proceedings of the 34th ACM International Conference on Supercomputing Pub Date : 2020-06-29 DOI: 10.1145/3392717.3392766

S. Tripathy, Debiprasanna Sahoo, M. Satpathy, M. Mutyam

{"title":"Fuzzy fairness controller for NVMe SSDs","authors":"S. Tripathy, Debiprasanna Sahoo, M. Satpathy, M. Mutyam","doi":"10.1145/3392717.3392766","DOIUrl":"https://doi.org/10.1145/3392717.3392766","url":null,"abstract":"Modern NVMe SSDs are widely deployed in diverse domains due to characteristics like high performance, robustness, and energy efficiency. It has been observed that the impact of interference among the concurrently running workloads on their overall response time differs significantly in these devices, which leads to unfairness. Workload intensity is a dominant factor influencing the interference. Prior works use a threshold value to characterize a workload as high-intensity or low-intensity; this type of characterization has drawbacks due to lack of information about the degree of low- or high-intensity. A data cache in an SSD controller - usually based on DRAMs - plays a crucial role in improving device throughput and lifetime. However, the degree of parallelism is limited at this level compared to the SSD back-end consisting of several channels, chips, and planes. Therefore, the impact of interference can be more pronounced at the data cache level. No prior work has addressed the fairness issue at the data cache level to the best of our knowledge. In this work, we address this issue by proposing a fuzzy logic-based fairness control mechanism. A fuzzy fairness controller characterizes the degree of flow intensity (i.e., the rate at which requests are generated) of a workload and assigns priorities to the workloads. We implement the proposed mechanism in the MQSim framework and observe that our technique improves the fairness, weighted speedup, and harmonic speedup of SSD by 29.84%, 11.24%, and 24.90% on average over state of the art, respectively. The peak gains in fairness, weighted speedup, and harmonic speedup are 2.02x, 29.44%, and 56.30%, respectively.","PeriodicalId":346687,"journal":{"name":"Proceedings of the 34th ACM International Conference on Supercomputing","volume":"35 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-06-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132156658","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 11

Efficient parallel algorithms for betweenness- and closeness-centrality in dynamic graphs 动态图中间心性和接近心性的高效并行算法

Proceedings of the 34th ACM International Conference on Supercomputing Pub Date : 2020-06-29 DOI: 10.1145/3392717.3392743

Kshitij Shukla, Sai Charan Regunta, Sai Harsh Tondomker, Kishore Kothapalli

{"title":"Efficient parallel algorithms for betweenness- and closeness-centrality in dynamic graphs","authors":"Kshitij Shukla, Sai Charan Regunta, Sai Harsh Tondomker, Kishore Kothapalli","doi":"10.1145/3392717.3392743","DOIUrl":"https://doi.org/10.1145/3392717.3392743","url":null,"abstract":"Finding the centrality measures of nodes in a graph is a problem of fundamental importance due to various applications from social networks, biological networks, and transportation networks. Given the large size of such graphs, it is natural to use parallelism as a recourse. There have been several studies that show how to compute the various centrality measures of nodes in a graph on parallel architectures, including multi-core systems and GPUs. However, as these graphs evolve and change, it is pertinent to study how to update the centrality measures on changes to the underlying graph. In this paper, we show novel parallel algorithms for updating the betweenness- and closeness-centrality values of nodes in a dynamic graph. Our algorithms process a batch of updates in parallel by extending the approach of handling a single update for betweenness-and closeness-centrality by Jamour et al. [16] and Sariyüce et al. [27], respectively. Besides, our algorithms incorporate mechanisms to exploit the structural properties of graphs for enhanced performance. We implement our algorithms on two parallel architectures: an Intel 24-core CPU and an Nvidia Tesla V100 GPU. To the best of our knowledge, we are the first to show GPU algorithms for the above two problems. We conduct detailed experiments to study the impact of various parameters associated with our algorithms and their implementation. Our results on a collection of real-world graphs indicate that our algorithms achieve a significant speedup over corresponding state-of-the-art algorithms.","PeriodicalId":346687,"journal":{"name":"Proceedings of the 34th ACM International Conference on Supercomputing","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-06-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130500200","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 9

AutoParBench AutoParBench

Proceedings of the 34th ACM International Conference on Supercomputing Pub Date : 2020-06-29 DOI: 10.1145/3392717.3392744

G. Mendonca, C. Liao, Fernando Magno Quintão Pereira

引用次数: 3

A coordinate-oblivious index for high-dimensional distance similarity searches on the GPU 基于GPU的高维距离相似度搜索的坐标无关索引

Proceedings of the 34th ACM International Conference on Supercomputing Pub Date : 2020-06-29 DOI: 10.1145/3392717.3392768

Brian Donnelly, M. Gowanlock

{"title":"A coordinate-oblivious index for high-dimensional distance similarity searches on the GPU","authors":"Brian Donnelly, M. Gowanlock","doi":"10.1145/3392717.3392768","DOIUrl":"https://doi.org/10.1145/3392717.3392768","url":null,"abstract":"We present COSS, an exact method for high-dimensional distance similarity self-joins using the GPU, which finds all points within a search distance e from each point in a dataset. The similarity self-join can take advantage of the massive parallelism afforded by GPUs, as each point can be searched in parallel. Despite high GPU throughput, distance similarity self-joins exhibit irregular memory access patterns which yield branch divergence and other performance limiting factors. Consequently, we propose several GPU optimizations to improve self-join query throughput, including an index designed for GPU architecture. As data dimensionality increases, the search space increases exponentially. Therefore, to find a reasonable number of neighbors for each point in the dataset, e may need to be large. The majority of indexing strategies that are used to prune the ∈-search focus on a spatial partition of data points based on each point's coordinates. As dimensionality increases, this data partitioning and pruning strategy yields exhaustive searches that eventually degrade to a brute force (quadratic) search, which is the well-known curse of dimensionality problem. To enable pruning the search using an indexing scheme in high-dimensional spaces, we depart from previous indexing approaches, and propose an indexing strategy that does not index based on each point's coordinate values. Instead, we index based on the distances to reference points, which are arbitrary points in the coordinate space. We show that our indexing scheme is able to prune the search for nearby points in high-dimensional spaces where other approaches yield high performance degradation. COSS achieves a speedup over CPU and GPU reference implementations up to 17.7X and 11.8X, respectively.","PeriodicalId":346687,"journal":{"name":"Proceedings of the 34th ACM International Conference on Supercomputing","volume":"22 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-06-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123877129","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

NV-group: link-efficient reduction for distributed deep learning on modern dense GPU systems NV-group:现代密集GPU系统上分布式深度学习的链路高效约简

Proceedings of the 34th ACM International Conference on Supercomputing Pub Date : 2020-06-29 DOI: 10.1145/3392717.3392771

Ching-Hsiang Chu, Pouya Kousha, A. Awan, Kawthar Shafie Khorassani, H. Subramoni, D. Panda

{"title":"NV-group: link-efficient reduction for distributed deep learning on modern dense GPU systems","authors":"Ching-Hsiang Chu, Pouya Kousha, A. Awan, Kawthar Shafie Khorassani, H. Subramoni, D. Panda","doi":"10.1145/3392717.3392771","DOIUrl":"https://doi.org/10.1145/3392717.3392771","url":null,"abstract":"The advanced fabrics like NVIDIA NVLink are enabling the deployment of dense Graphics Processing Unit (GPU) systems such as DGX-2 and Summit. With the wide adoption of large-scale GPU-enabled systems for distributed deep learning (DL) training, it is vital to design efficient communication such as the Allreduce operation to achieve near-ideal speedup at scale. In this paper, we propose a link-efficient scheme through NVLink-aware cooperative reduction kernels to significantly accelerate Allreduce operations for distributed deep learning applications. By overlapping computation and communication and maximizing utilization of all available NVLinks between CPU and GPU, as well as among GPUs, we demonstrate 1.8X performance improvement of Allreduce on 1,536 GPUs compared to state-of-the-art GPU-Aware MPI and NVIDIA NCCL libraries. Finally, we demonstrate 93.9% and 89.7% scaling efficiency (i.e., 15X and 172X speedup) for training ResNet-50 models using TensorFlow on a 16-GPU DGX-2 node and on 192-GPUs of the Summit system, respectively. To the best of our knowledge, this is the first study that achieves near-ideal scaling efficiency for distributed DL training and deals with designs tailored for cutting-edge systems like DGX-2 and Summit clusters.","PeriodicalId":346687,"journal":{"name":"Proceedings of the 34th ACM International Conference on Supercomputing","volume":"41 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-06-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122740054","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 25

RICH 丰富的

Proceedings of the 34th ACM International Conference on Supercomputing Pub Date : 2020-06-29 DOI: 10.1163/1878-9781_ejiw_sim_0018460

Vladimir Dimic, Miquel Moretó, M. Casas, Jan Ciesko, Mateo Valero

引用次数: 2

CSB-RNN: a faster-than-realtime RNN acceleration framework with compressed structured blocks CSB-RNN:一个比实时更快的RNN加速框架，具有压缩的结构化块

Proceedings of the 34th ACM International Conference on Supercomputing Pub Date : 2020-05-11 DOI: 10.1145/3392717.3392749

Runbin Shi, Peiyan Dong, Tong Geng, Yuhao Ding, Xiaolong Ma, Hayden Kwok-Hay So, M. Herbordt, Ang Li, Yanzhi Wang

{"title":"CSB-RNN: a faster-than-realtime RNN acceleration framework with compressed structured blocks","authors":"Runbin Shi, Peiyan Dong, Tong Geng, Yuhao Ding, Xiaolong Ma, Hayden Kwok-Hay So, M. Herbordt, Ang Li, Yanzhi Wang","doi":"10.1145/3392717.3392749","DOIUrl":"https://doi.org/10.1145/3392717.3392749","url":null,"abstract":"Recurrent neural networks (RNNs) have been widely adopted in temporal sequence analysis, where realtime performance is often in demand. However, RNNs suffer from heavy computational workload as the model often comes with large weight matrices. Pruning (a model compression method) schemes have been proposed for RNNs to eliminate the redundant (close-to-zero) weight values. On one hand, the non-structured pruning methods achieve a high pruning rate but introducing computation irregularity (random sparsity), which is unfriendly to parallel hardware. On the other hand, hardware-oriented structured pruning suffers from low pruning rate due to restricted constraints on allowable pruning structure. This paper presents CSB-RNN, an optimized full-stack RNN framework with a novel compressed structured block (CSB) pruning technique. The CSB pruned RNN model comes with both fine pruning granularity that facilitates a high pruning rate and regular structure that benefits the hardware parallelism. To address the challenges in parallelizing the CSB pruned model inference with fine-grained structural sparsity, we propose a novel hardware architecture with a dedicated compiler. Gaining from the architecture-compilation co-design, the hardware not only supports various RNN cell types, but is also able to address the challenging workload imbalance issue and therefore significantly improves the hardware efficiency (utilization). Compared to the vanilla design without optimizations, the hardware utilization has been enhanced by over 2X. With experiments on 10 RNN models from multiple application domains, CSB pruning demonstrates 3.5X-25X lossless pruning rate, which is 1.6X to 3.9X over existing designs. With several other innovations applied, the CSB-RNN inference can achieve faster-than-realtime latency of 0.79μs-6.58μs in an FPGA implementation, which contributes to 1.12X-12.57X lower latency and 3.53X-58.89X improvement on power-efficiency over the state-of-the-art.","PeriodicalId":346687,"journal":{"name":"Proceedings of the 34th ACM International Conference on Supercomputing","volume":"25 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-05-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129572096","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 10

CFDNet: a deep learning-based accelerator for fluid simulations CFDNet:基于深度学习的流体模拟加速器

Proceedings of the 34th ACM International Conference on Supercomputing Pub Date : 2020-05-09 DOI: 10.1145/3392717.3392772

Octavi Obiols-Sales, Abhinav Vishnu, Nicholas Malaya, Aparna Chandramowlishwaran

{"title":"CFDNet: a deep learning-based accelerator for fluid simulations","authors":"Octavi Obiols-Sales, Abhinav Vishnu, Nicholas Malaya, Aparna Chandramowlishwaran","doi":"10.1145/3392717.3392772","DOIUrl":"https://doi.org/10.1145/3392717.3392772","url":null,"abstract":"CFD is widely used in physical system design and optimization, where it is used to predict engineering quantities of interest, such as the lift on a plane wing or the drag on a motor vehicle. However, many systems of interest are prohibitively expensive for design optimization, due to the expense of evaluating CFD simulations. To render the computation tractable, reduced-order or surrogate models are used to accelerate simulations while respecting the convergence constraints provided by the higher-fidelity solution. This paper introduces CFDNet - a physical simulation and deep learning coupled framework, for accelerating the convergence of Reynolds Averaged Navier-Stokes simulations. CFDNet is designed to predict the primary physical properties of the fluid including velocity, pressure, and eddy viscosity using a single convolutional neural network at its core. We evaluate CFDNet on a variety of use-cases, both extrapolative and interpolative, where test geometries are observed/not-observed during training. Our results show that CFDNet meets the convergence constraints of the domain-specific physics solver while outperforming it by 1.9 - 7.4X on both steady laminar and turbulent flows. Moreover, we demonstrate the generalization capacity of CFDNet by testing its prediction on new geometries unseen during training. In this case, the approach meets the CFD convergence criterion while still providing significant speedups over traditional domain-only models.","PeriodicalId":346687,"journal":{"name":"Proceedings of the 34th ACM International Conference on Supercomputing","volume":"62 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-05-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123147668","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 84

How I learned to stop worrying about user-visible endpoints and love MPI 我是如何学会不再担心用户可见的端点而爱上MPI的

Proceedings of the 34th ACM International Conference on Supercomputing Pub Date : 2020-05-01 DOI: 10.1145/3392717.3392773

Rohit Zambre, Aparna Chandramowlishwaran, P. Balaji

{"title":"How I learned to stop worrying about user-visible endpoints and love MPI","authors":"Rohit Zambre, Aparna Chandramowlishwaran, P. Balaji","doi":"10.1145/3392717.3392773","DOIUrl":"https://doi.org/10.1145/3392717.3392773","url":null,"abstract":"MPI+threads is gaining prominence as an alternative to the traditional \"MPI everywhere\" model in order to better handle the disproportionate increase in the number of cores compared with other on-node resources. However, the communication performance of MPI+threads can be 100x slower than that of MPI everywhere. Both MPI users and developers are to blame for this slowdown. MPI users traditionally have not exposed logical communication parallelism. Consequently, MPI libraries have used conservative approaches, such as a global critical section, to maintain MPI's ordering constraints for MPI+threads, thus serializing access to the underlying parallel network resources and limiting performance. To enhance the communication performance of MPI+threads, researchers have proposed MPI Endpoints as a user-visible extension to the MPI-3.1 standard. MPI Endpoints allows a single process to create multiple MPI ranks within a communicator. This could, in theory, allow each thread to have a dedicated communication path to the network, thus avoiding resource contention between threads and improving performance. The onus of mapping threads to endpoints, however, would then be on domain scientists. In this paper we play the role of devil's advocate and question the need for such user-visible endpoints. We certainly agree that dedicated communication channels are critical. To what extent, however, can we hide these channels inside the MPI library without modifying the MPI standard and thus unburden the user? More important, what functionality would we lose through such abstraction? This paper answers these questions through a new implementation of the MPI-3.1 standard that uses multiple virtual communication interfaces (VCIs) inside the MPI library. VCIs abstract underlying network contexts. When users expose parallelism through existing MPI mechanisms, the MPI library maps that parallelism to the VCIs, relieving the domain scientists from worrying about endpoints. We identify cases where user-exposed parallelism on VCIs perform as well as user-visible endpoints, as well as cases where such abstraction hurts performance.","PeriodicalId":346687,"journal":{"name":"Proceedings of the 34th ACM International Conference on Supercomputing","volume":"88 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127146093","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 11