Proceedings of the General Purpose GPUs最新文献

Merge or Separate?: Multi-job Scheduling for OpenCL Kernels on CPU/GPU Platforms 合并还是分离?: OpenCL内核在CPU/GPU平台上的多任务调度

Proceedings of the General Purpose GPUs Pub Date : 2017-02-04 DOI: 10.1145/3038228.3038235

Y. Wen, M. O’Boyle

引用次数: 35

Directive-based tile abstraction to distribute loops on accelerators 基于指令的贴图抽象，在加速器上分配循环

Proceedings of the General Purpose GPUs Pub Date : 2017-02-04 DOI: 10.1145/3038228.3038238

T. Vanderbruggen, John Cavazos, C. Liao, D. Quinlan

{"title":"Directive-based tile abstraction to distribute loops on accelerators","authors":"T. Vanderbruggen, John Cavazos, C. Liao, D. Quinlan","doi":"10.1145/3038228.3038238","DOIUrl":"https://doi.org/10.1145/3038228.3038238","url":null,"abstract":"Optimizing applications for the next generation of super-computers requires next generation compilers. These compilers need to provide an abstraction for the developer to describe the inner working of applications. And, next generation compilers need to be able to intelligently apply optimizations to a wide variety of algorithms solved by scientific applications. They need to optimize applications for any workload targeting any architecture. In this paper, we present an important component of any next generation supercomputer compiler that we call TileK. TileK is a tile abstraction used to generate distributed kernels from nested loops. It provides a high-level abstraction used to decompose the iteration space of loop nests. Its directives-based language enables an effective and efficient placement of multi-dimensional computations on the 3D topology of accelerators (e.g. graphics processing units, GPUs). We implemented both the tile abstraction and the kernel generator in ROSE Compiler. We used TileK to parallelize linear algebra kernels and stencils, targeting multicore CPUs (pThread) and GPUs (OpenCL). TileK enabled us to explore and evaluate a large optimization space of many versions of these kernels for varying input sizes. Our results shows that the selection of a given optimization for a specific input size is a challenging problem.","PeriodicalId":108772,"journal":{"name":"Proceedings of the General Purpose GPUs","volume":"98 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-02-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125701155","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Understanding The Security of Discrete GPUs 了解离散gpu的安全性

Proceedings of the General Purpose GPUs Pub Date : 2017-02-04 DOI: 10.1145/3038228.3038233

Zhiting Zhu, Sangman Kim, Yuri Rozhanski, Yige Hu, E. Witchel, M. Silberstein

引用次数: 38

Efficient Convex Optimization on GPUs for Embedded Model Predictive Control 基于gpu的嵌入式模型预测控制高效凸优化

Proceedings of the General Purpose GPUs Pub Date : 2017-02-04 DOI: 10.1145/3038228.3038234

Leiming Yu, A. Goldsmith, S. D. Cairano

{"title":"Efficient Convex Optimization on GPUs for Embedded Model Predictive Control","authors":"Leiming Yu, A. Goldsmith, S. D. Cairano","doi":"10.1145/3038228.3038234","DOIUrl":"https://doi.org/10.1145/3038228.3038234","url":null,"abstract":"GPU applications have traditionally run on PCs or in larger scale systems. With the introduction of the Tegra line of mobile processors, NVIDIA expanded the types of systems that can exploit the massive parallelism offered by GPU computing architectures. In this paper, we evaluate the suitability of the Tegra X1 processor as a platform for embedded model predictive control. MPC relies on the real time solution of a convex optimization problem to compute the control input(s) to a system. Relative to traditional control techniques such as PID, MPC is very computationally demanding. Quadratic programming algorithms for the solution of convex optimization problems generally lend themselves to parallelization. However, until the introduction of the Tegra, there has never been an off-the-shelf embedded processor that would enable a massively parallel embedded implementation. We investigate two different gradient based algorithms, ADMM and PQP, for solving the QP that occurs in a large class of MPC problems. The performance of these algorithms is dominated by the performance of matrix-matrix and matrix-vector products. Our work focuses on maximizing the performance of these operations for relatively small matrices of 100 to 1000 elements per dimension, which are common in the MPC control implementations found in automotive and factory automation applications. Modern BLAS libraries for CPUs and GPUs are quantitatively evaluated. We create SGEMV kernels that can outperform the state-of-the-art cuBLAS by 2.3x on TX1. Different kernel fusion schemes utilizing concurrent kernel execution and zero copy mechanisms are investigated. For ADMM, our implementation achieves 46.6x speedup over the single threaded CPU version and 2.7x speedup over the optimized OpenBLAS version. For PQP, we achieve 41.2x speedup over the single threaded CPU version and 4.2x speedup over the OpenBLAS version.","PeriodicalId":108772,"journal":{"name":"Proceedings of the General Purpose GPUs","volume":"57 18","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-02-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131874771","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 20

DNNMark: A Deep Neural Network Benchmark Suite for GPUs DNNMark: gpu的深度神经网络基准测试套件

Proceedings of the General Purpose GPUs Pub Date : 2017-02-04 DOI: 10.1145/3038228.3038239

Shi Dong, D. Kaeli

{"title":"DNNMark: A Deep Neural Network Benchmark Suite for GPUs","authors":"Shi Dong, D. Kaeli","doi":"10.1145/3038228.3038239","DOIUrl":"https://doi.org/10.1145/3038228.3038239","url":null,"abstract":"Deep learning algorithms have been growing in popularity in the machine learning community based on their ability to accurately perform clustering and classification in a number of domains. One commonly used class of deep learning techniques is deep neural networks (DNNs). They are composed of a massive number of artificial neurons and many hidden layers. As a complex scientific computing problem, deep neural networks encompass a rich set of computing-intensive and data-intensive workloads including convolution, pooling, and inner products. All of these workloads can be used as standalone programs to benchmark hardware performance. As the GPU develops into a popular platform used to run deep learning algorithms, hardware architects should be equipped with a representative set of benchmarks that can be used to explore design tradeoffs. This suite of workloads can be constructed from a number of primitive operations commonly found in deep neural networks. In this paper, we present DNNMark, a GPU benchmark suite that consists of a collection of deep neural network primitives, covering a rich set of GPU computing patterns. This suite is designed to be a highly configurable, extensible, and flexible framework, in which benchmarks can run either individually or collectively. The goal is to provide hardware and software developers with a set of kernels that can be used to develop increasingly complex workload scenarios. We also evaluate selected benchmarks in the suite and showcase their execution behavior on a Nvidia K40 GPU.","PeriodicalId":108772,"journal":{"name":"Proceedings of the General Purpose GPUs","volume":"33 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-02-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127118345","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 57

Parallel CCD++ on GPU for Matrix Factorization 基于GPU的并行CCD++矩阵分解

Proceedings of the General Purpose GPUs Pub Date : 2017-02-04 DOI: 10.1145/3038228.3038240

Israt Nisa, Aravind Sukumaran-Rajam, Rakshith Kunchum, P. Sadayappan

引用次数: 21

Launch-Time Optimization of OpenCL GPU Kernels OpenCL GPU内核的启动时优化

Proceedings of the General Purpose GPUs Pub Date : 2017-02-04 DOI: 10.1145/3038228.3038236

Andrew S. D. Lee, T. Abdelrahman

{"title":"Launch-Time Optimization of OpenCL GPU Kernels","authors":"Andrew S. D. Lee, T. Abdelrahman","doi":"10.1145/3038228.3038236","DOIUrl":"https://doi.org/10.1145/3038228.3038236","url":null,"abstract":"OpenCL compiles a GPU kernel first and then launches it for execution, providing the kernel at this launch with its arguments and its launch geometry. Although some of the kernel inputs and the launch geometry remain constant across all threads during execution, the compiler is unable to treat them as such, which limits its ability to apply several optimizations, including constant propagation, constant folding, strength reduction and loop unrolling. In this paper we describe a novel approach to address this problem. At compile-time, the kernel input arguments and variables holding constant values of the launch geometry are identified. The kernel's PTX code is analyzed and is marked with annotations that reflect the actions an optimizer would have performed had the values of the aforementioned variables been compile-time-known constants. At kernel launch time the annotations, combined with the now known values of these variables, are used to optimize the code, thereby improving kernel performance. We compare the execution time of 12 GPU kernels compiled with a standard LLVM-based compilation flow to their execution time when compiled with the same flow, modified to implement our approach. The results show that annotation processing is fast and that kernel performance is improved by a factor of up to 2.13X and on average by 1.17X across the benchmarks. When taking into account the entire compilation flow, the resulting benefit depends on how often a kernel is launched. When the kernel is launched many times with the same arguments and the same geometry, kernel execution time, including the compilation flow, benefits by similar factors. However, when the kernel is launched with different arguments and/or geometries, performance suffers because of the overhead of repeated PTX-to-Cubin compilation.","PeriodicalId":108772,"journal":{"name":"Proceedings of the General Purpose GPUs","volume":"184 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-02-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115552780","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 4

High-performance Cholesky factorization for GPU-only execution 仅用于gpu执行的高性能Cholesky分解

Proceedings of the General Purpose GPUs Pub Date : 2017-02-04 DOI: 10.1145/3038228.3038237

A. Haidar, A. Abdelfattah, S. Tomov, J. Dongarra

{"title":"High-performance Cholesky factorization for GPU-only execution","authors":"A. Haidar, A. Abdelfattah, S. Tomov, J. Dongarra","doi":"10.1145/3038228.3038237","DOIUrl":"https://doi.org/10.1145/3038228.3038237","url":null,"abstract":"We present our performance analysis, algorithm designs, and the optimizations needed for the development of high-performance GPU-only algorithms, and in particular, for the dense Cholesky factorization. In contrast to currently promoted designs that solve parallelism challenges on multicore architectures by representing algorithms as Directed Acyclic Graphs (DAGs), where nodes are tasks of fine granularity and edges are the dependencies between the tasks, our designs explicitly target manycore architectures like GPUs and feature coarse granularity tasks (that can be hierarchically split into fine grain data-parallel subtasks). Furthermore, in contrast to hybrid algorithms that schedule difficult to parallelize tasks on CPUs, we develop highly-efficient code for entirely GPU execution. GPU-only codes remove the expensive CPU-to-GPU communications and the tuning challenges related to slow CPU and/or low CPU-to-GPU bandwidth. We show that on latest GPUs, like the P100, this becomes so important that the GPU-only code even outperforms the hybrid MAGMA algorithms when the CPU tasks and communications can not be entirely overlapped with GPU computations. Weachieve up to 4,300 GFlop/s in double precision on a P100 GPU, which is about 7-8x faster than high-end multicore CPUs, e.g., two 10-cores Intel Xeon E5-2650 v3 Haswell CPUs, where MKL runs up to about 500-600 Gflop/s. The new algorithm also outperforms significantly the GPU-only implementation currently available in the NVIDIA cuSOLVER library.","PeriodicalId":108772,"journal":{"name":"Proceedings of the General Purpose GPUs","volume":"23 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-02-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116592858","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 15

Proceedings of the General Purpose GPUs 通用图形处理器学报

Proceedings of the General Purpose GPUs Pub Date : 1900-01-01 DOI: 10.1145/3038228

引用次数: 0