Proceedings of the 11th Workshop on General Purpose GPUs最新文献

筛选
英文 中文
Oversubscribed Command Queues in GPUs gpu命令队列超额订阅
Proceedings of the 11th Workshop on General Purpose GPUs Pub Date : 2018-02-24 DOI: 10.1145/3180270.3180271
Sooraj Puthoor, Xulong Tang, Joseph Gross, Bradford M. Beckmann
{"title":"Oversubscribed Command Queues in GPUs","authors":"Sooraj Puthoor, Xulong Tang, Joseph Gross, Bradford M. Beckmann","doi":"10.1145/3180270.3180271","DOIUrl":"https://doi.org/10.1145/3180270.3180271","url":null,"abstract":"As GPUs become larger and provide an increasing number of parallel execution units, a single kernel is no longer sufficient to utilize all available resources. As a result, GPU applications are beginning to use fine-grain asynchronous kernels, which are executed in parallel and expose more concurrency. Currently, the Heterogeneous System Architecture (HSA) and Compute Unified Device Architecture (CUDA) specifications support concurrent kernel launches with the help of multiple command queues (a.k.a. HSA queues and CUDA streams, respectively). In conjunction, GPU hardware has decreased launch overheads making fine-grain kernels more attractive. Although increasing the number of command queues is good for kernel concurrency, the GPU hardware can only monitor a fixed number of queues at any given time. Therefore, if the number of command queues exceeds hardware's monitoring capability, the queues become oversubscribed and hardware has to service some of these queues sequentially. This mapping process periodically swaps between all allocated queues and limits the available concurrency to the ready kernels in the currently mapped queues. In this paper, we bring to attention the queue oversubscription challenge and demonstrate one solution, queue prioritization, which provides up to 45x speedup for NW benchmark against the baseline that swaps queues in a round-robin fashion.","PeriodicalId":274320,"journal":{"name":"Proceedings of the 11th Workshop on General Purpose GPUs","volume":"60 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-02-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123448133","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 16
Overcoming the difficulty of large-scale CGH generation on multi-GPU cluster 克服了在多gpu集群上大规模生成CGH的困难
Proceedings of the 11th Workshop on General Purpose GPUs Pub Date : 2018-02-24 DOI: 10.1145/3180270.3180273
T. Baba, Shinpei Watanabe, B. Jackin, Takeshi Ohkawa, K. Ootsu, T. Yokota, Y. Hayasaki, T. Yatagai
{"title":"Overcoming the difficulty of large-scale CGH generation on multi-GPU cluster","authors":"T. Baba, Shinpei Watanabe, B. Jackin, Takeshi Ohkawa, K. Ootsu, T. Yokota, Y. Hayasaki, T. Yatagai","doi":"10.1145/3180270.3180273","DOIUrl":"https://doi.org/10.1145/3180270.3180273","url":null,"abstract":"The 3D holographic display has long been expected as a future human interface as it does not require users to wear special devices. However, its heavy computation requirement prevents the realization of such displays. A recent study says that objects and holograms with several giga-pixels should be processed in real time for the realization of high resolution and wide view angle. To this problem, first, we have adapted a conventional FFT algorithm to a GPU cluster environment in order to avoid heavy inter-node communications. Then, we have applied several single-node and multi-node optimization and parallelization techniques. The single-node optimizations include the change of the way of object decomposition, reduction of data transfer between CPU and GPU, kernel integration, stream processing, and utilization of multi-GPU within a node. The multi-node optimizations include distribution methods of object data from host node to the other nodes. The experimental results show that the intra-node optimizations attain 11.52 times speed-up from the original single node code. Further, multi-node optimizations using 8 nodes, 2 GPUs per node, attain the execution time of 4.28 sec. for generating 1.6 giga-pixel hologram from 3.2 giga-pixel object. It means 237.92 times speed-up of the sequential processing by CPU using a conventional FFT-based algorithm.","PeriodicalId":274320,"journal":{"name":"Proceedings of the 11th Workshop on General Purpose GPUs","volume":"31 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-02-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127404585","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
Transparent Avoidance of Redundant Data Transfer on GPU-enabled Apache Spark 在启用gpu的Apache Spark上透明避免冗余数据传输
Proceedings of the 11th Workshop on General Purpose GPUs Pub Date : 2018-02-24 DOI: 10.1145/3180270.3180276
Ryo Asai, M. Okita, Fumihiko Ino, K. Hagihara
{"title":"Transparent Avoidance of Redundant Data Transfer on GPU-enabled Apache Spark","authors":"Ryo Asai, M. Okita, Fumihiko Ino, K. Hagihara","doi":"10.1145/3180270.3180276","DOIUrl":"https://doi.org/10.1145/3180270.3180276","url":null,"abstract":"This paper presents an extension to IBMSparkGPU, which is an Apache Spark framework capable of compute- or memory-intensive tasks on a graphics processing unit (GPU). The key contribution of this extension is an automated runtime that implicitly avoids redundant CPU-GPU data transfers without code modification. To realize this transparent capability, the runtime analyzes data dependencies of the target Spark code dynamically; thus, intermediate data on GPU can be cached, reused, and replaced appropriately to achieve acceleration. Experimental results demonstrate that the proposed runtime accelerates a machine learning application by a factor of 1.3. We expect that the proposed transparent runtime will be useful for accelerating IBMSparkGPU applications, which typically include a chain of GPU-offloaded tasks.","PeriodicalId":274320,"journal":{"name":"Proceedings of the 11th Workshop on General Purpose GPUs","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-02-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130718109","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 7
Generating High Performance GPU Code using Rewrite Rules with Lift 使用Lift重写规则生成高性能GPU代码
Proceedings of the 11th Workshop on General Purpose GPUs Pub Date : 2018-02-24 DOI: 10.1145/3180270.3182628
Christophe Dubach
{"title":"Generating High Performance GPU Code using Rewrite Rules with Lift","authors":"Christophe Dubach","doi":"10.1145/3180270.3182628","DOIUrl":"https://doi.org/10.1145/3180270.3182628","url":null,"abstract":"Graphic processors (GPUs) are the cornerstone of modern heterogeneous systems. GPUs exhibit tremendous computational power but are notoriously hard to program. High-level programming languages and domainspecific languages have been proposed to address this issue. However, they often rely on complex analysis in the compiler or device-specific implementations to achieve maximum performance. This means that compilers and software implementations need to be re-written and re-tuned continuously as new hardware emerge. In this talk, I will present Lift, a novel high-level data-parallel programming model. The language is based on a surprisingly small set of functional primitives which can be combined to define higher-level hardwareagnostic algorithmic patterns. A system of rewrite-rules is used to derive device-specific optimised low-level implementations of the algorithmic patterns. The rules encode both algorithmic choices and low-level optimisations in a unified system and let the compiler explore the optimisation space automatically. Our results show that the generated code matches the performance of highly tuned implementations of several computational kernels from linear algebra and stencil domain across various classes of GPUs.","PeriodicalId":274320,"journal":{"name":"Proceedings of the 11th Workshop on General Purpose GPUs","volume":"14 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-02-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134481049","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
GPU-based Acceleration of Detailed Tissue-Scale Cardiac Simulations 基于gpu的精细组织尺度心脏模拟加速
Proceedings of the 11th Workshop on General Purpose GPUs Pub Date : 2018-02-24 DOI: 10.1145/3180270.3180274
Neringa Altanaite, J. Langguth
{"title":"GPU-based Acceleration of Detailed Tissue-Scale Cardiac Simulations","authors":"Neringa Altanaite, J. Langguth","doi":"10.1145/3180270.3180274","DOIUrl":"https://doi.org/10.1145/3180270.3180274","url":null,"abstract":"We present a GPU based implementation for tissue-scale 3D simulations of the human cardiac ventricle using a physiologically realistic cell model. Computational challenges in such simulations arise from two factors, the first of which is the sheer amount of computation when simulating a large number of cardiac cells in a detailed model containing 104 calcium release units, 106 stochastically changing ryanodine receptors and 1.5 x 105 L-type calcium channels per cell. Additional challenges arise from the fact that the computational tasks have various levels of arithmetic intensity and control complexity, which require careful adaptation of the simulation code to the target device. By exploiting the strengths of the GPU, we obtain a performance that is far superior to that of the CPU, and also significantly higher than that of other state of the art manycore devices, thus paving the way for detailed whole-heart simulations in future generations of leadership class supercomputers.","PeriodicalId":274320,"journal":{"name":"Proceedings of the 11th Workshop on General Purpose GPUs","volume":"23 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-02-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124451327","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
A Case for Scoped Persist Barriers in GPUs gpu中范围持续障碍的案例
Proceedings of the 11th Workshop on General Purpose GPUs Pub Date : 2018-02-24 DOI: 10.1145/3180270.3180275
Dibakar Gope, Arkaprava Basu, Sooraj Puthoor, Mitesh R. Meswani
{"title":"A Case for Scoped Persist Barriers in GPUs","authors":"Dibakar Gope, Arkaprava Basu, Sooraj Puthoor, Mitesh R. Meswani","doi":"10.1145/3180270.3180275","DOIUrl":"https://doi.org/10.1145/3180270.3180275","url":null,"abstract":"Two key trends in computing are evident --- emergence of GPU as a first-class compute element and emergence of byte-addressable nonvolatile memory technologies (NVRAM) as DRAM-supplement. GPUs and NVRAMs are likely to coexist in future systems. However, previous works have either focused on GPUs or on NVRAMs, in isolation. In this work, we investigate the enhancements necessary for a GPU to efficiently and correctly manipulate NVRAM-resident persistent data structures. Specifically, we find that previously proposed CPU-centric persist barriers fall short for GPUs. We thus introduce the concept of scoped persist barriers that aligns with the hierarchical programming framework of GPUs. Scoped persist barriers enable GPU programmers to express which execution group (a.k.a., scope) a given persist barrier applies to. We demonstrate that: 1 use of narrower scope than algorithmically-required can lead to inconsistency of persistent data structure, and 2 use of wider scope than necessary leads to significant performance loss (e.g., 25% or more). Therefore, a future GPU can benefit from persist barriers with different scopes.","PeriodicalId":274320,"journal":{"name":"Proceedings of the 11th Workshop on General Purpose GPUs","volume":"88 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-02-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123192709","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 5
MaxPair: Enhance OpenCL Concurrent Kernel Execution by Weighted Maximum Matching MaxPair:通过加权最大匹配增强OpenCL并发内核执行
Proceedings of the 11th Workshop on General Purpose GPUs Pub Date : 2018-02-24 DOI: 10.1145/3180270.3180272
Yuan Wen, M. O’Boyle, Christian Fensch
{"title":"MaxPair: Enhance OpenCL Concurrent Kernel Execution by Weighted Maximum Matching","authors":"Yuan Wen, M. O’Boyle, Christian Fensch","doi":"10.1145/3180270.3180272","DOIUrl":"https://doi.org/10.1145/3180270.3180272","url":null,"abstract":"Executing multiple OpenCL kernels on the same GPU concurrently is a promising method for improving hardware utilisation and system performance. Schemes of scheduling impact the resulting performance significantly by selecting different kernels to run together on the same GPU. Existing approaches use either execution time or relative speedup of kernels as a guide to group and map them to the device. However, these simple methods work on the cost of providing suboptimal performance. In this paper, we propose a graph-based algorithm to schedule co-run kernel in pairs to optimise the system performance. Target workloads are represented by a graph, in which vertices stand for distinct kernels while edges between two vertices represent the corresponding two kernels co-execution can deliver a better performance than run them one after another. Edges are weighted to provide information of performance gain from co-execution. Our algorithm works in the way of finding out the maximum weighted matching of the graph. By maximising the accumulated weights, our algorithm improves performance significantly comparing to other approaches.","PeriodicalId":274320,"journal":{"name":"Proceedings of the 11th Workshop on General Purpose GPUs","volume":"15 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-02-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128155468","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 16
Initial Steps toward Making GPU a First-Class Computing Resource: Sharing and Resource Management 使GPU成为一流计算资源的初步步骤:共享和资源管理
Proceedings of the 11th Workshop on General Purpose GPUs Pub Date : 2018-02-24 DOI: 10.1145/3180270.3182629
Jun Yang
{"title":"Initial Steps toward Making GPU a First-Class Computing Resource: Sharing and Resource Management","authors":"Jun Yang","doi":"10.1145/3180270.3182629","DOIUrl":"https://doi.org/10.1145/3180270.3182629","url":null,"abstract":"GPUs have evolved from traditional graphics accelerators into core compute engines for a broad class of general-purpose applications. However, current commercial offerings fall short of the great potential of GPUs largely because they cannot be managed as easily as the CPU. The enormous amount of hardware resources are often greatly underutilized. We developed new architecture features to enable fine-grained sharing of GPUs, termed Simultaneous Multi-kernel (SMK), in a similar way the CPU achieves sharing via simultaneous multithreading (SMT). With SMK, different applications can co-exist in every streaming multiprocessor of a GPU, in a fully controlled way. High resource utilization can be achieved by exploiting heterogeneity of different application behaviors. Resource apportion among sharers are developed for fairness, throughput, and quality-of-services. We also envision that SMK can enable better manageability of GPUs and new features such as more efficient synchronization mechanisms within an application.","PeriodicalId":274320,"journal":{"name":"Proceedings of the 11th Workshop on General Purpose GPUs","volume":"21 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-02-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128345037","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
相关产品
×
本文献相关产品
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信