{"title":"Merge or Separate?: Multi-job Scheduling for OpenCL Kernels on CPU/GPU Platforms","authors":"Y. Wen, M. O’Boyle","doi":"10.1145/3038228.3038235","DOIUrl":"https://doi.org/10.1145/3038228.3038235","url":null,"abstract":"Computer systems are increasingly heterogeneous with nodes consisting of CPUs and GPU accelerators. As such systems become mainstream, they move away from specialized high-performance single application platforms to a more general setting with multiple, concurrent, application jobs. Determining how jobs should be dynamically best scheduled to heterogeneous devices is non-trivial. In certain cases, performance is maximized if jobs are allocated to a single device, in others, sharing is preferable. In this paper, we present a runtime framework which schedules multi-user OpenCL tasks to their most suitable device in a CPU/GPU system. We use a machine learning-based predictive model at runtime to detect whether to merge OpenCL kernels or schedule them separately to the most appropriate devices without the need for ahead-of-time profiling. We evaluate out approach over a wide range of workloads, on two separate platforms. We consistently show significant performance and turn-around time improvement over the state-of-the-art across programs, workload, and platforms.","PeriodicalId":108772,"journal":{"name":"Proceedings of the General Purpose GPUs","volume":"10 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-02-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131150894","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
T. Vanderbruggen, John Cavazos, C. Liao, D. Quinlan
{"title":"Directive-based tile abstraction to distribute loops on accelerators","authors":"T. Vanderbruggen, John Cavazos, C. Liao, D. Quinlan","doi":"10.1145/3038228.3038238","DOIUrl":"https://doi.org/10.1145/3038228.3038238","url":null,"abstract":"Optimizing applications for the next generation of super-computers requires next generation compilers. These compilers need to provide an abstraction for the developer to describe the inner working of applications. And, next generation compilers need to be able to intelligently apply optimizations to a wide variety of algorithms solved by scientific applications. They need to optimize applications for any workload targeting any architecture. In this paper, we present an important component of any next generation supercomputer compiler that we call TileK. TileK is a tile abstraction used to generate distributed kernels from nested loops. It provides a high-level abstraction used to decompose the iteration space of loop nests. Its directives-based language enables an effective and efficient placement of multi-dimensional computations on the 3D topology of accelerators (e.g. graphics processing units, GPUs). We implemented both the tile abstraction and the kernel generator in ROSE Compiler. We used TileK to parallelize linear algebra kernels and stencils, targeting multicore CPUs (pThread) and GPUs (OpenCL). TileK enabled us to explore and evaluate a large optimization space of many versions of these kernels for varying input sizes. Our results shows that the selection of a given optimization for a specific input size is a challenging problem.","PeriodicalId":108772,"journal":{"name":"Proceedings of the General Purpose GPUs","volume":"98 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-02-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125701155","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Zhiting Zhu, Sangman Kim, Yuri Rozhanski, Yige Hu, E. Witchel, M. Silberstein
{"title":"Understanding The Security of Discrete GPUs","authors":"Zhiting Zhu, Sangman Kim, Yuri Rozhanski, Yige Hu, E. Witchel, M. Silberstein","doi":"10.1145/3038228.3038233","DOIUrl":"https://doi.org/10.1145/3038228.3038233","url":null,"abstract":"GPUs have become an integral part of modern systems, but their implications for system security are not yet clear. This paper demonstrates both that discrete GPUs cannot be used as secure co-processors and that GPUs provide a stealthy platform for malware. First, we examine a recent proposal to use discrete GPUs as secure co-processors and show that the security guarantees of the proposed system do not hold on the GPUs we investigate. Second, we demonstrate that (under certain circumstances) it is possible to bypass IOMMU protections and create stealthy, long-lived GPU-based malware. We demonstrate a novel attack that compromises the in-kernel GPU driver and one that compromises GPU microcode to gain full access to CPU physical memory. In general, we find that the highly sophisticated, but poorly documented GPU hardware architecture, hidden behind obscure close-source device drivers and vendor-specific APIs, not only make GPUs a poor choice for applications requiring strong security, but also make GPUs into a security threat.","PeriodicalId":108772,"journal":{"name":"Proceedings of the General Purpose GPUs","volume":"80 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-02-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130615724","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Efficient Convex Optimization on GPUs for Embedded Model Predictive Control","authors":"Leiming Yu, A. Goldsmith, S. D. Cairano","doi":"10.1145/3038228.3038234","DOIUrl":"https://doi.org/10.1145/3038228.3038234","url":null,"abstract":"GPU applications have traditionally run on PCs or in larger scale systems. With the introduction of the Tegra line of mobile processors, NVIDIA expanded the types of systems that can exploit the massive parallelism offered by GPU computing architectures. In this paper, we evaluate the suitability of the Tegra X1 processor as a platform for embedded model predictive control. MPC relies on the real time solution of a convex optimization problem to compute the control input(s) to a system. Relative to traditional control techniques such as PID, MPC is very computationally demanding. Quadratic programming algorithms for the solution of convex optimization problems generally lend themselves to parallelization. However, until the introduction of the Tegra, there has never been an off-the-shelf embedded processor that would enable a massively parallel embedded implementation. We investigate two different gradient based algorithms, ADMM and PQP, for solving the QP that occurs in a large class of MPC problems. The performance of these algorithms is dominated by the performance of matrix-matrix and matrix-vector products. Our work focuses on maximizing the performance of these operations for relatively small matrices of 100 to 1000 elements per dimension, which are common in the MPC control implementations found in automotive and factory automation applications. Modern BLAS libraries for CPUs and GPUs are quantitatively evaluated. We create SGEMV kernels that can outperform the state-of-the-art cuBLAS by 2.3x on TX1. Different kernel fusion schemes utilizing concurrent kernel execution and zero copy mechanisms are investigated. For ADMM, our implementation achieves 46.6x speedup over the single threaded CPU version and 2.7x speedup over the optimized OpenBLAS version. For PQP, we achieve 41.2x speedup over the single threaded CPU version and 4.2x speedup over the OpenBLAS version.","PeriodicalId":108772,"journal":{"name":"Proceedings of the General Purpose GPUs","volume":"57 18","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-02-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131874771","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"DNNMark: A Deep Neural Network Benchmark Suite for GPUs","authors":"Shi Dong, D. Kaeli","doi":"10.1145/3038228.3038239","DOIUrl":"https://doi.org/10.1145/3038228.3038239","url":null,"abstract":"Deep learning algorithms have been growing in popularity in the machine learning community based on their ability to accurately perform clustering and classification in a number of domains. One commonly used class of deep learning techniques is deep neural networks (DNNs). They are composed of a massive number of artificial neurons and many hidden layers. As a complex scientific computing problem, deep neural networks encompass a rich set of computing-intensive and data-intensive workloads including convolution, pooling, and inner products. All of these workloads can be used as standalone programs to benchmark hardware performance. As the GPU develops into a popular platform used to run deep learning algorithms, hardware architects should be equipped with a representative set of benchmarks that can be used to explore design tradeoffs. This suite of workloads can be constructed from a number of primitive operations commonly found in deep neural networks. In this paper, we present DNNMark, a GPU benchmark suite that consists of a collection of deep neural network primitives, covering a rich set of GPU computing patterns. This suite is designed to be a highly configurable, extensible, and flexible framework, in which benchmarks can run either individually or collectively. The goal is to provide hardware and software developers with a set of kernels that can be used to develop increasingly complex workload scenarios. We also evaluate selected benchmarks in the suite and showcase their execution behavior on a Nvidia K40 GPU.","PeriodicalId":108772,"journal":{"name":"Proceedings of the General Purpose GPUs","volume":"33 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-02-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127118345","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Israt Nisa, Aravind Sukumaran-Rajam, Rakshith Kunchum, P. Sadayappan
{"title":"Parallel CCD++ on GPU for Matrix Factorization","authors":"Israt Nisa, Aravind Sukumaran-Rajam, Rakshith Kunchum, P. Sadayappan","doi":"10.1145/3038228.3038240","DOIUrl":"https://doi.org/10.1145/3038228.3038240","url":null,"abstract":"Matrix factorization of an incomplete matrix is useful in applications such as recommender systems. Several iterative algorithms have been proposed for matrix factorization for recommender systems, including Cyclic Coordinate Descent (CCD). Recently a variant of CCD called CCD++ was developed as an attractive algorithm for parallel implementation on multicore processors. In this paper, we address the parallelization of CCD++ for a GPU. Key considerations are the reduction of data volume transferred from/to GPU global memory and minimization of intra-warp load imbalance. Starting with a base implementation, we successively improve the GPU implementation of CCD++ using loop fusion and tiling, using performance insights from hardware counter data. The resulting algorithm is shown to be faster than the best reported multicore implementation of CCD++ as well as the best reported GPU implementation of matrix factorization (using ALS, Alternating Least Squares).","PeriodicalId":108772,"journal":{"name":"Proceedings of the General Purpose GPUs","volume":"15 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-02-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127730651","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Launch-Time Optimization of OpenCL GPU Kernels","authors":"Andrew S. D. Lee, T. Abdelrahman","doi":"10.1145/3038228.3038236","DOIUrl":"https://doi.org/10.1145/3038228.3038236","url":null,"abstract":"OpenCL compiles a GPU kernel first and then launches it for execution, providing the kernel at this launch with its arguments and its launch geometry. Although some of the kernel inputs and the launch geometry remain constant across all threads during execution, the compiler is unable to treat them as such, which limits its ability to apply several optimizations, including constant propagation, constant folding, strength reduction and loop unrolling. In this paper we describe a novel approach to address this problem. At compile-time, the kernel input arguments and variables holding constant values of the launch geometry are identified. The kernel's PTX code is analyzed and is marked with annotations that reflect the actions an optimizer would have performed had the values of the aforementioned variables been compile-time-known constants. At kernel launch time the annotations, combined with the now known values of these variables, are used to optimize the code, thereby improving kernel performance. We compare the execution time of 12 GPU kernels compiled with a standard LLVM-based compilation flow to their execution time when compiled with the same flow, modified to implement our approach. The results show that annotation processing is fast and that kernel performance is improved by a factor of up to 2.13X and on average by 1.17X across the benchmarks. When taking into account the entire compilation flow, the resulting benefit depends on how often a kernel is launched. When the kernel is launched many times with the same arguments and the same geometry, kernel execution time, including the compilation flow, benefits by similar factors. However, when the kernel is launched with different arguments and/or geometries, performance suffers because of the overhead of repeated PTX-to-Cubin compilation.","PeriodicalId":108772,"journal":{"name":"Proceedings of the General Purpose GPUs","volume":"184 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-02-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115552780","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"High-performance Cholesky factorization for GPU-only execution","authors":"A. Haidar, A. Abdelfattah, S. Tomov, J. Dongarra","doi":"10.1145/3038228.3038237","DOIUrl":"https://doi.org/10.1145/3038228.3038237","url":null,"abstract":"We present our performance analysis, algorithm designs, and the optimizations needed for the development of high-performance GPU-only algorithms, and in particular, for the dense Cholesky factorization. In contrast to currently promoted designs that solve parallelism challenges on multicore architectures by representing algorithms as Directed Acyclic Graphs (DAGs), where nodes are tasks of fine granularity and edges are the dependencies between the tasks, our designs explicitly target manycore architectures like GPUs and feature coarse granularity tasks (that can be hierarchically split into fine grain data-parallel subtasks). Furthermore, in contrast to hybrid algorithms that schedule difficult to parallelize tasks on CPUs, we develop highly-efficient code for entirely GPU execution. GPU-only codes remove the expensive CPU-to-GPU communications and the tuning challenges related to slow CPU and/or low CPU-to-GPU bandwidth. We show that on latest GPUs, like the P100, this becomes so important that the GPU-only code even outperforms the hybrid MAGMA algorithms when the CPU tasks and communications can not be entirely overlapped with GPU computations. Weachieve up to 4,300 GFlop/s in double precision on a P100 GPU, which is about 7-8x faster than high-end multicore CPUs, e.g., two 10-cores Intel Xeon E5-2650 v3 Haswell CPUs, where MKL runs up to about 500-600 Gflop/s. The new algorithm also outperforms significantly the GPU-only implementation currently available in the NVIDIA cuSOLVER library.","PeriodicalId":108772,"journal":{"name":"Proceedings of the General Purpose GPUs","volume":"23 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-02-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116592858","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Proceedings of the General Purpose GPUs","authors":"","doi":"10.1145/3038228","DOIUrl":"https://doi.org/10.1145/3038228","url":null,"abstract":"","PeriodicalId":108772,"journal":{"name":"Proceedings of the General Purpose GPUs","volume":"18 2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1900-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114463971","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}