Proceedings of the 11th Workshop on General Purpose GPUs最新文献

Oversubscribed Command Queues in GPUs gpu命令队列超额订阅

Proceedings of the 11th Workshop on General Purpose GPUs Pub Date : 2018-02-24 DOI: 10.1145/3180270.3180271

Sooraj Puthoor, Xulong Tang, Joseph Gross, Bradford M. Beckmann

{"title":"Oversubscribed Command Queues in GPUs","authors":"Sooraj Puthoor, Xulong Tang, Joseph Gross, Bradford M. Beckmann","doi":"10.1145/3180270.3180271","DOIUrl":"https://doi.org/10.1145/3180270.3180271","url":null,"abstract":"As GPUs become larger and provide an increasing number of parallel execution units, a single kernel is no longer sufficient to utilize all available resources. As a result, GPU applications are beginning to use fine-grain asynchronous kernels, which are executed in parallel and expose more concurrency. Currently, the Heterogeneous System Architecture (HSA) and Compute Unified Device Architecture (CUDA) specifications support concurrent kernel launches with the help of multiple command queues (a.k.a. HSA queues and CUDA streams, respectively). In conjunction, GPU hardware has decreased launch overheads making fine-grain kernels more attractive. Although increasing the number of command queues is good for kernel concurrency, the GPU hardware can only monitor a fixed number of queues at any given time. Therefore, if the number of command queues exceeds hardware's monitoring capability, the queues become oversubscribed and hardware has to service some of these queues sequentially. This mapping process periodically swaps between all allocated queues and limits the available concurrency to the ready kernels in the currently mapped queues. In this paper, we bring to attention the queue oversubscription challenge and demonstrate one solution, queue prioritization, which provides up to 45x speedup for NW benchmark against the baseline that swaps queues in a round-robin fashion.","PeriodicalId":274320,"journal":{"name":"Proceedings of the 11th Workshop on General Purpose GPUs","volume":"60 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-02-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123448133","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 16

Overcoming the difficulty of large-scale CGH generation on multi-GPU cluster 克服了在多gpu集群上大规模生成CGH的困难

Proceedings of the 11th Workshop on General Purpose GPUs Pub Date : 2018-02-24 DOI: 10.1145/3180270.3180273

T. Baba, Shinpei Watanabe, B. Jackin, Takeshi Ohkawa, K. Ootsu, T. Yokota, Y. Hayasaki, T. Yatagai

{"title":"Overcoming the difficulty of large-scale CGH generation on multi-GPU cluster","authors":"T. Baba, Shinpei Watanabe, B. Jackin, Takeshi Ohkawa, K. Ootsu, T. Yokota, Y. Hayasaki, T. Yatagai","doi":"10.1145/3180270.3180273","DOIUrl":"https://doi.org/10.1145/3180270.3180273","url":null,"abstract":"The 3D holographic display has long been expected as a future human interface as it does not require users to wear special devices. However, its heavy computation requirement prevents the realization of such displays. A recent study says that objects and holograms with several giga-pixels should be processed in real time for the realization of high resolution and wide view angle. To this problem, first, we have adapted a conventional FFT algorithm to a GPU cluster environment in order to avoid heavy inter-node communications. Then, we have applied several single-node and multi-node optimization and parallelization techniques. The single-node optimizations include the change of the way of object decomposition, reduction of data transfer between CPU and GPU, kernel integration, stream processing, and utilization of multi-GPU within a node. The multi-node optimizations include distribution methods of object data from host node to the other nodes. The experimental results show that the intra-node optimizations attain 11.52 times speed-up from the original single node code. Further, multi-node optimizations using 8 nodes, 2 GPUs per node, attain the execution time of 4.28 sec. for generating 1.6 giga-pixel hologram from 3.2 giga-pixel object. It means 237.92 times speed-up of the sequential processing by CPU using a conventional FFT-based algorithm.","PeriodicalId":274320,"journal":{"name":"Proceedings of the 11th Workshop on General Purpose GPUs","volume":"31 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-02-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127404585","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 3

Transparent Avoidance of Redundant Data Transfer on GPU-enabled Apache Spark 在启用gpu的Apache Spark上透明避免冗余数据传输

Proceedings of the 11th Workshop on General Purpose GPUs Pub Date : 2018-02-24 DOI: 10.1145/3180270.3180276

Ryo Asai, M. Okita, Fumihiko Ino, K. Hagihara

引用次数: 7

Generating High Performance GPU Code using Rewrite Rules with Lift 使用Lift重写规则生成高性能GPU代码

Proceedings of the 11th Workshop on General Purpose GPUs Pub Date : 2018-02-24 DOI: 10.1145/3180270.3182628

Christophe Dubach

{"title":"Generating High Performance GPU Code using Rewrite Rules with Lift","authors":"Christophe Dubach","doi":"10.1145/3180270.3182628","DOIUrl":"https://doi.org/10.1145/3180270.3182628","url":null,"abstract":"Graphic processors (GPUs) are the cornerstone of modern heterogeneous systems. GPUs exhibit tremendous computational power but are notoriously hard to program. High-level programming languages and domainspecific languages have been proposed to address this issue. However, they often rely on complex analysis in the compiler or device-specific implementations to achieve maximum performance. This means that compilers and software implementations need to be re-written and re-tuned continuously as new hardware emerge. In this talk, I will present Lift, a novel high-level data-parallel programming model. The language is based on a surprisingly small set of functional primitives which can be combined to define higher-level hardwareagnostic algorithmic patterns. A system of rewrite-rules is used to derive device-specific optimised low-level implementations of the algorithmic patterns. The rules encode both algorithmic choices and low-level optimisations in a unified system and let the compiler explore the optimisation space automatically. Our results show that the generated code matches the performance of highly tuned implementations of several computational kernels from linear algebra and stencil domain across various classes of GPUs.","PeriodicalId":274320,"journal":{"name":"Proceedings of the 11th Workshop on General Purpose GPUs","volume":"14 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-02-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134481049","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

GPU-based Acceleration of Detailed Tissue-Scale Cardiac Simulations 基于gpu的精细组织尺度心脏模拟加速

Proceedings of the 11th Workshop on General Purpose GPUs Pub Date : 2018-02-24 DOI: 10.1145/3180270.3180274

Neringa Altanaite, J. Langguth

引用次数: 2

A Case for Scoped Persist Barriers in GPUs gpu中范围持续障碍的案例

Proceedings of the 11th Workshop on General Purpose GPUs Pub Date : 2018-02-24 DOI: 10.1145/3180270.3180275

Dibakar Gope, Arkaprava Basu, Sooraj Puthoor, Mitesh R. Meswani

引用次数: 5

MaxPair: Enhance OpenCL Concurrent Kernel Execution by Weighted Maximum Matching MaxPair:通过加权最大匹配增强OpenCL并发内核执行

Proceedings of the 11th Workshop on General Purpose GPUs Pub Date : 2018-02-24 DOI: 10.1145/3180270.3180272

Yuan Wen, M. O’Boyle, Christian Fensch

{"title":"MaxPair: Enhance OpenCL Concurrent Kernel Execution by Weighted Maximum Matching","authors":"Yuan Wen, M. O’Boyle, Christian Fensch","doi":"10.1145/3180270.3180272","DOIUrl":"https://doi.org/10.1145/3180270.3180272","url":null,"abstract":"Executing multiple OpenCL kernels on the same GPU concurrently is a promising method for improving hardware utilisation and system performance. Schemes of scheduling impact the resulting performance significantly by selecting different kernels to run together on the same GPU. Existing approaches use either execution time or relative speedup of kernels as a guide to group and map them to the device. However, these simple methods work on the cost of providing suboptimal performance. In this paper, we propose a graph-based algorithm to schedule co-run kernel in pairs to optimise the system performance. Target workloads are represented by a graph, in which vertices stand for distinct kernels while edges between two vertices represent the corresponding two kernels co-execution can deliver a better performance than run them one after another. Edges are weighted to provide information of performance gain from co-execution. Our algorithm works in the way of finding out the maximum weighted matching of the graph. By maximising the accumulated weights, our algorithm improves performance significantly comparing to other approaches.","PeriodicalId":274320,"journal":{"name":"Proceedings of the 11th Workshop on General Purpose GPUs","volume":"15 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-02-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128155468","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 16

Initial Steps toward Making GPU a First-Class Computing Resource: Sharing and Resource Management 使GPU成为一流计算资源的初步步骤:共享和资源管理

Proceedings of the 11th Workshop on General Purpose GPUs Pub Date : 2018-02-24 DOI: 10.1145/3180270.3182629

Jun Yang

引用次数: 0