Proceedings of the General Purpose GPUs最新文献

筛选
英文 中文
Merge or Separate?: Multi-job Scheduling for OpenCL Kernels on CPU/GPU Platforms 合并还是分离?: OpenCL内核在CPU/GPU平台上的多任务调度
Proceedings of the General Purpose GPUs Pub Date : 2017-02-04 DOI: 10.1145/3038228.3038235
Y. Wen, M. O’Boyle
{"title":"Merge or Separate?: Multi-job Scheduling for OpenCL Kernels on CPU/GPU Platforms","authors":"Y. Wen, M. O’Boyle","doi":"10.1145/3038228.3038235","DOIUrl":"https://doi.org/10.1145/3038228.3038235","url":null,"abstract":"Computer systems are increasingly heterogeneous with nodes consisting of CPUs and GPU accelerators. As such systems become mainstream, they move away from specialized high-performance single application platforms to a more general setting with multiple, concurrent, application jobs. Determining how jobs should be dynamically best scheduled to heterogeneous devices is non-trivial. In certain cases, performance is maximized if jobs are allocated to a single device, in others, sharing is preferable. In this paper, we present a runtime framework which schedules multi-user OpenCL tasks to their most suitable device in a CPU/GPU system. We use a machine learning-based predictive model at runtime to detect whether to merge OpenCL kernels or schedule them separately to the most appropriate devices without the need for ahead-of-time profiling. We evaluate out approach over a wide range of workloads, on two separate platforms. We consistently show significant performance and turn-around time improvement over the state-of-the-art across programs, workload, and platforms.","PeriodicalId":108772,"journal":{"name":"Proceedings of the General Purpose GPUs","volume":"10 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-02-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131150894","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 35
Directive-based tile abstraction to distribute loops on accelerators 基于指令的贴图抽象,在加速器上分配循环
Proceedings of the General Purpose GPUs Pub Date : 2017-02-04 DOI: 10.1145/3038228.3038238
T. Vanderbruggen, John Cavazos, C. Liao, D. Quinlan
{"title":"Directive-based tile abstraction to distribute loops on accelerators","authors":"T. Vanderbruggen, John Cavazos, C. Liao, D. Quinlan","doi":"10.1145/3038228.3038238","DOIUrl":"https://doi.org/10.1145/3038228.3038238","url":null,"abstract":"Optimizing applications for the next generation of super-computers requires next generation compilers. These compilers need to provide an abstraction for the developer to describe the inner working of applications. And, next generation compilers need to be able to intelligently apply optimizations to a wide variety of algorithms solved by scientific applications. They need to optimize applications for any workload targeting any architecture. In this paper, we present an important component of any next generation supercomputer compiler that we call TileK. TileK is a tile abstraction used to generate distributed kernels from nested loops. It provides a high-level abstraction used to decompose the iteration space of loop nests. Its directives-based language enables an effective and efficient placement of multi-dimensional computations on the 3D topology of accelerators (e.g. graphics processing units, GPUs). We implemented both the tile abstraction and the kernel generator in ROSE Compiler. We used TileK to parallelize linear algebra kernels and stencils, targeting multicore CPUs (pThread) and GPUs (OpenCL). TileK enabled us to explore and evaluate a large optimization space of many versions of these kernels for varying input sizes. Our results shows that the selection of a given optimization for a specific input size is a challenging problem.","PeriodicalId":108772,"journal":{"name":"Proceedings of the General Purpose GPUs","volume":"98 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-02-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125701155","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Understanding The Security of Discrete GPUs 了解离散gpu的安全性
Proceedings of the General Purpose GPUs Pub Date : 2017-02-04 DOI: 10.1145/3038228.3038233
Zhiting Zhu, Sangman Kim, Yuri Rozhanski, Yige Hu, E. Witchel, M. Silberstein
{"title":"Understanding The Security of Discrete GPUs","authors":"Zhiting Zhu, Sangman Kim, Yuri Rozhanski, Yige Hu, E. Witchel, M. Silberstein","doi":"10.1145/3038228.3038233","DOIUrl":"https://doi.org/10.1145/3038228.3038233","url":null,"abstract":"GPUs have become an integral part of modern systems, but their implications for system security are not yet clear. This paper demonstrates both that discrete GPUs cannot be used as secure co-processors and that GPUs provide a stealthy platform for malware. First, we examine a recent proposal to use discrete GPUs as secure co-processors and show that the security guarantees of the proposed system do not hold on the GPUs we investigate. Second, we demonstrate that (under certain circumstances) it is possible to bypass IOMMU protections and create stealthy, long-lived GPU-based malware. We demonstrate a novel attack that compromises the in-kernel GPU driver and one that compromises GPU microcode to gain full access to CPU physical memory. In general, we find that the highly sophisticated, but poorly documented GPU hardware architecture, hidden behind obscure close-source device drivers and vendor-specific APIs, not only make GPUs a poor choice for applications requiring strong security, but also make GPUs into a security threat.","PeriodicalId":108772,"journal":{"name":"Proceedings of the General Purpose GPUs","volume":"80 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-02-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130615724","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 38
Efficient Convex Optimization on GPUs for Embedded Model Predictive Control 基于gpu的嵌入式模型预测控制高效凸优化
Proceedings of the General Purpose GPUs Pub Date : 2017-02-04 DOI: 10.1145/3038228.3038234
Leiming Yu, A. Goldsmith, S. D. Cairano
{"title":"Efficient Convex Optimization on GPUs for Embedded Model Predictive Control","authors":"Leiming Yu, A. Goldsmith, S. D. Cairano","doi":"10.1145/3038228.3038234","DOIUrl":"https://doi.org/10.1145/3038228.3038234","url":null,"abstract":"GPU applications have traditionally run on PCs or in larger scale systems. With the introduction of the Tegra line of mobile processors, NVIDIA expanded the types of systems that can exploit the massive parallelism offered by GPU computing architectures. In this paper, we evaluate the suitability of the Tegra X1 processor as a platform for embedded model predictive control. MPC relies on the real time solution of a convex optimization problem to compute the control input(s) to a system. Relative to traditional control techniques such as PID, MPC is very computationally demanding. Quadratic programming algorithms for the solution of convex optimization problems generally lend themselves to parallelization. However, until the introduction of the Tegra, there has never been an off-the-shelf embedded processor that would enable a massively parallel embedded implementation. We investigate two different gradient based algorithms, ADMM and PQP, for solving the QP that occurs in a large class of MPC problems. The performance of these algorithms is dominated by the performance of matrix-matrix and matrix-vector products. Our work focuses on maximizing the performance of these operations for relatively small matrices of 100 to 1000 elements per dimension, which are common in the MPC control implementations found in automotive and factory automation applications. Modern BLAS libraries for CPUs and GPUs are quantitatively evaluated. We create SGEMV kernels that can outperform the state-of-the-art cuBLAS by 2.3x on TX1. Different kernel fusion schemes utilizing concurrent kernel execution and zero copy mechanisms are investigated. For ADMM, our implementation achieves 46.6x speedup over the single threaded CPU version and 2.7x speedup over the optimized OpenBLAS version. For PQP, we achieve 41.2x speedup over the single threaded CPU version and 4.2x speedup over the OpenBLAS version.","PeriodicalId":108772,"journal":{"name":"Proceedings of the General Purpose GPUs","volume":"57 18","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-02-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131874771","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 20
DNNMark: A Deep Neural Network Benchmark Suite for GPUs DNNMark: gpu的深度神经网络基准测试套件
Proceedings of the General Purpose GPUs Pub Date : 2017-02-04 DOI: 10.1145/3038228.3038239
Shi Dong, D. Kaeli
{"title":"DNNMark: A Deep Neural Network Benchmark Suite for GPUs","authors":"Shi Dong, D. Kaeli","doi":"10.1145/3038228.3038239","DOIUrl":"https://doi.org/10.1145/3038228.3038239","url":null,"abstract":"Deep learning algorithms have been growing in popularity in the machine learning community based on their ability to accurately perform clustering and classification in a number of domains. One commonly used class of deep learning techniques is deep neural networks (DNNs). They are composed of a massive number of artificial neurons and many hidden layers. As a complex scientific computing problem, deep neural networks encompass a rich set of computing-intensive and data-intensive workloads including convolution, pooling, and inner products. All of these workloads can be used as standalone programs to benchmark hardware performance. As the GPU develops into a popular platform used to run deep learning algorithms, hardware architects should be equipped with a representative set of benchmarks that can be used to explore design tradeoffs. This suite of workloads can be constructed from a number of primitive operations commonly found in deep neural networks. In this paper, we present DNNMark, a GPU benchmark suite that consists of a collection of deep neural network primitives, covering a rich set of GPU computing patterns. This suite is designed to be a highly configurable, extensible, and flexible framework, in which benchmarks can run either individually or collectively. The goal is to provide hardware and software developers with a set of kernels that can be used to develop increasingly complex workload scenarios. We also evaluate selected benchmarks in the suite and showcase their execution behavior on a Nvidia K40 GPU.","PeriodicalId":108772,"journal":{"name":"Proceedings of the General Purpose GPUs","volume":"33 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-02-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127118345","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 57
Parallel CCD++ on GPU for Matrix Factorization 基于GPU的并行CCD++矩阵分解
Proceedings of the General Purpose GPUs Pub Date : 2017-02-04 DOI: 10.1145/3038228.3038240
Israt Nisa, Aravind Sukumaran-Rajam, Rakshith Kunchum, P. Sadayappan
{"title":"Parallel CCD++ on GPU for Matrix Factorization","authors":"Israt Nisa, Aravind Sukumaran-Rajam, Rakshith Kunchum, P. Sadayappan","doi":"10.1145/3038228.3038240","DOIUrl":"https://doi.org/10.1145/3038228.3038240","url":null,"abstract":"Matrix factorization of an incomplete matrix is useful in applications such as recommender systems. Several iterative algorithms have been proposed for matrix factorization for recommender systems, including Cyclic Coordinate Descent (CCD). Recently a variant of CCD called CCD++ was developed as an attractive algorithm for parallel implementation on multicore processors. In this paper, we address the parallelization of CCD++ for a GPU. Key considerations are the reduction of data volume transferred from/to GPU global memory and minimization of intra-warp load imbalance. Starting with a base implementation, we successively improve the GPU implementation of CCD++ using loop fusion and tiling, using performance insights from hardware counter data. The resulting algorithm is shown to be faster than the best reported multicore implementation of CCD++ as well as the best reported GPU implementation of matrix factorization (using ALS, Alternating Least Squares).","PeriodicalId":108772,"journal":{"name":"Proceedings of the General Purpose GPUs","volume":"15 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-02-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127730651","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 21
Launch-Time Optimization of OpenCL GPU Kernels OpenCL GPU内核的启动时优化
Proceedings of the General Purpose GPUs Pub Date : 2017-02-04 DOI: 10.1145/3038228.3038236
Andrew S. D. Lee, T. Abdelrahman
{"title":"Launch-Time Optimization of OpenCL GPU Kernels","authors":"Andrew S. D. Lee, T. Abdelrahman","doi":"10.1145/3038228.3038236","DOIUrl":"https://doi.org/10.1145/3038228.3038236","url":null,"abstract":"OpenCL compiles a GPU kernel first and then launches it for execution, providing the kernel at this launch with its arguments and its launch geometry. Although some of the kernel inputs and the launch geometry remain constant across all threads during execution, the compiler is unable to treat them as such, which limits its ability to apply several optimizations, including constant propagation, constant folding, strength reduction and loop unrolling. In this paper we describe a novel approach to address this problem. At compile-time, the kernel input arguments and variables holding constant values of the launch geometry are identified. The kernel's PTX code is analyzed and is marked with annotations that reflect the actions an optimizer would have performed had the values of the aforementioned variables been compile-time-known constants. At kernel launch time the annotations, combined with the now known values of these variables, are used to optimize the code, thereby improving kernel performance. We compare the execution time of 12 GPU kernels compiled with a standard LLVM-based compilation flow to their execution time when compiled with the same flow, modified to implement our approach. The results show that annotation processing is fast and that kernel performance is improved by a factor of up to 2.13X and on average by 1.17X across the benchmarks. When taking into account the entire compilation flow, the resulting benefit depends on how often a kernel is launched. When the kernel is launched many times with the same arguments and the same geometry, kernel execution time, including the compilation flow, benefits by similar factors. However, when the kernel is launched with different arguments and/or geometries, performance suffers because of the overhead of repeated PTX-to-Cubin compilation.","PeriodicalId":108772,"journal":{"name":"Proceedings of the General Purpose GPUs","volume":"184 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-02-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115552780","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
High-performance Cholesky factorization for GPU-only execution 仅用于gpu执行的高性能Cholesky分解
Proceedings of the General Purpose GPUs Pub Date : 2017-02-04 DOI: 10.1145/3038228.3038237
A. Haidar, A. Abdelfattah, S. Tomov, J. Dongarra
{"title":"High-performance Cholesky factorization for GPU-only execution","authors":"A. Haidar, A. Abdelfattah, S. Tomov, J. Dongarra","doi":"10.1145/3038228.3038237","DOIUrl":"https://doi.org/10.1145/3038228.3038237","url":null,"abstract":"We present our performance analysis, algorithm designs, and the optimizations needed for the development of high-performance GPU-only algorithms, and in particular, for the dense Cholesky factorization. In contrast to currently promoted designs that solve parallelism challenges on multicore architectures by representing algorithms as Directed Acyclic Graphs (DAGs), where nodes are tasks of fine granularity and edges are the dependencies between the tasks, our designs explicitly target manycore architectures like GPUs and feature coarse granularity tasks (that can be hierarchically split into fine grain data-parallel subtasks). Furthermore, in contrast to hybrid algorithms that schedule difficult to parallelize tasks on CPUs, we develop highly-efficient code for entirely GPU execution. GPU-only codes remove the expensive CPU-to-GPU communications and the tuning challenges related to slow CPU and/or low CPU-to-GPU bandwidth. We show that on latest GPUs, like the P100, this becomes so important that the GPU-only code even outperforms the hybrid MAGMA algorithms when the CPU tasks and communications can not be entirely overlapped with GPU computations. Weachieve up to 4,300 GFlop/s in double precision on a P100 GPU, which is about 7-8x faster than high-end multicore CPUs, e.g., two 10-cores Intel Xeon E5-2650 v3 Haswell CPUs, where MKL runs up to about 500-600 Gflop/s. The new algorithm also outperforms significantly the GPU-only implementation currently available in the NVIDIA cuSOLVER library.","PeriodicalId":108772,"journal":{"name":"Proceedings of the General Purpose GPUs","volume":"23 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-02-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116592858","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 15
Proceedings of the General Purpose GPUs 通用图形处理器学报
Proceedings of the General Purpose GPUs Pub Date : 1900-01-01 DOI: 10.1145/3038228
{"title":"Proceedings of the General Purpose GPUs","authors":"","doi":"10.1145/3038228","DOIUrl":"https://doi.org/10.1145/3038228","url":null,"abstract":"","PeriodicalId":108772,"journal":{"name":"Proceedings of the General Purpose GPUs","volume":"18 2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1900-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114463971","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
相关产品
×
本文献相关产品
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信