Sooraj Puthoor, Xulong Tang, Joseph Gross, Bradford M. Beckmann
{"title":"Oversubscribed Command Queues in GPUs","authors":"Sooraj Puthoor, Xulong Tang, Joseph Gross, Bradford M. Beckmann","doi":"10.1145/3180270.3180271","DOIUrl":"https://doi.org/10.1145/3180270.3180271","url":null,"abstract":"As GPUs become larger and provide an increasing number of parallel execution units, a single kernel is no longer sufficient to utilize all available resources. As a result, GPU applications are beginning to use fine-grain asynchronous kernels, which are executed in parallel and expose more concurrency. Currently, the Heterogeneous System Architecture (HSA) and Compute Unified Device Architecture (CUDA) specifications support concurrent kernel launches with the help of multiple command queues (a.k.a. HSA queues and CUDA streams, respectively). In conjunction, GPU hardware has decreased launch overheads making fine-grain kernels more attractive. Although increasing the number of command queues is good for kernel concurrency, the GPU hardware can only monitor a fixed number of queues at any given time. Therefore, if the number of command queues exceeds hardware's monitoring capability, the queues become oversubscribed and hardware has to service some of these queues sequentially. This mapping process periodically swaps between all allocated queues and limits the available concurrency to the ready kernels in the currently mapped queues. In this paper, we bring to attention the queue oversubscription challenge and demonstrate one solution, queue prioritization, which provides up to 45x speedup for NW benchmark against the baseline that swaps queues in a round-robin fashion.","PeriodicalId":274320,"journal":{"name":"Proceedings of the 11th Workshop on General Purpose GPUs","volume":"60 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-02-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123448133","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
T. Baba, Shinpei Watanabe, B. Jackin, Takeshi Ohkawa, K. Ootsu, T. Yokota, Y. Hayasaki, T. Yatagai
{"title":"Overcoming the difficulty of large-scale CGH generation on multi-GPU cluster","authors":"T. Baba, Shinpei Watanabe, B. Jackin, Takeshi Ohkawa, K. Ootsu, T. Yokota, Y. Hayasaki, T. Yatagai","doi":"10.1145/3180270.3180273","DOIUrl":"https://doi.org/10.1145/3180270.3180273","url":null,"abstract":"The 3D holographic display has long been expected as a future human interface as it does not require users to wear special devices. However, its heavy computation requirement prevents the realization of such displays. A recent study says that objects and holograms with several giga-pixels should be processed in real time for the realization of high resolution and wide view angle. To this problem, first, we have adapted a conventional FFT algorithm to a GPU cluster environment in order to avoid heavy inter-node communications. Then, we have applied several single-node and multi-node optimization and parallelization techniques. The single-node optimizations include the change of the way of object decomposition, reduction of data transfer between CPU and GPU, kernel integration, stream processing, and utilization of multi-GPU within a node. The multi-node optimizations include distribution methods of object data from host node to the other nodes. The experimental results show that the intra-node optimizations attain 11.52 times speed-up from the original single node code. Further, multi-node optimizations using 8 nodes, 2 GPUs per node, attain the execution time of 4.28 sec. for generating 1.6 giga-pixel hologram from 3.2 giga-pixel object. It means 237.92 times speed-up of the sequential processing by CPU using a conventional FFT-based algorithm.","PeriodicalId":274320,"journal":{"name":"Proceedings of the 11th Workshop on General Purpose GPUs","volume":"31 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-02-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127404585","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Transparent Avoidance of Redundant Data Transfer on GPU-enabled Apache Spark","authors":"Ryo Asai, M. Okita, Fumihiko Ino, K. Hagihara","doi":"10.1145/3180270.3180276","DOIUrl":"https://doi.org/10.1145/3180270.3180276","url":null,"abstract":"This paper presents an extension to IBMSparkGPU, which is an Apache Spark framework capable of compute- or memory-intensive tasks on a graphics processing unit (GPU). The key contribution of this extension is an automated runtime that implicitly avoids redundant CPU-GPU data transfers without code modification. To realize this transparent capability, the runtime analyzes data dependencies of the target Spark code dynamically; thus, intermediate data on GPU can be cached, reused, and replaced appropriately to achieve acceleration. Experimental results demonstrate that the proposed runtime accelerates a machine learning application by a factor of 1.3. We expect that the proposed transparent runtime will be useful for accelerating IBMSparkGPU applications, which typically include a chain of GPU-offloaded tasks.","PeriodicalId":274320,"journal":{"name":"Proceedings of the 11th Workshop on General Purpose GPUs","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-02-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130718109","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Generating High Performance GPU Code using Rewrite Rules with Lift","authors":"Christophe Dubach","doi":"10.1145/3180270.3182628","DOIUrl":"https://doi.org/10.1145/3180270.3182628","url":null,"abstract":"Graphic processors (GPUs) are the cornerstone of modern heterogeneous systems. GPUs exhibit tremendous computational power but are notoriously hard to program. High-level programming languages and domainspecific languages have been proposed to address this issue. However, they often rely on complex analysis in the compiler or device-specific implementations to achieve maximum performance. This means that compilers and software implementations need to be re-written and re-tuned continuously as new hardware emerge. In this talk, I will present Lift, a novel high-level data-parallel programming model. The language is based on a surprisingly small set of functional primitives which can be combined to define higher-level hardwareagnostic algorithmic patterns. A system of rewrite-rules is used to derive device-specific optimised low-level implementations of the algorithmic patterns. The rules encode both algorithmic choices and low-level optimisations in a unified system and let the compiler explore the optimisation space automatically. Our results show that the generated code matches the performance of highly tuned implementations of several computational kernels from linear algebra and stencil domain across various classes of GPUs.","PeriodicalId":274320,"journal":{"name":"Proceedings of the 11th Workshop on General Purpose GPUs","volume":"14 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-02-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134481049","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"GPU-based Acceleration of Detailed Tissue-Scale Cardiac Simulations","authors":"Neringa Altanaite, J. Langguth","doi":"10.1145/3180270.3180274","DOIUrl":"https://doi.org/10.1145/3180270.3180274","url":null,"abstract":"We present a GPU based implementation for tissue-scale 3D simulations of the human cardiac ventricle using a physiologically realistic cell model. Computational challenges in such simulations arise from two factors, the first of which is the sheer amount of computation when simulating a large number of cardiac cells in a detailed model containing 104 calcium release units, 106 stochastically changing ryanodine receptors and 1.5 x 105 L-type calcium channels per cell. Additional challenges arise from the fact that the computational tasks have various levels of arithmetic intensity and control complexity, which require careful adaptation of the simulation code to the target device. By exploiting the strengths of the GPU, we obtain a performance that is far superior to that of the CPU, and also significantly higher than that of other state of the art manycore devices, thus paving the way for detailed whole-heart simulations in future generations of leadership class supercomputers.","PeriodicalId":274320,"journal":{"name":"Proceedings of the 11th Workshop on General Purpose GPUs","volume":"23 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-02-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124451327","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Dibakar Gope, Arkaprava Basu, Sooraj Puthoor, Mitesh R. Meswani
{"title":"A Case for Scoped Persist Barriers in GPUs","authors":"Dibakar Gope, Arkaprava Basu, Sooraj Puthoor, Mitesh R. Meswani","doi":"10.1145/3180270.3180275","DOIUrl":"https://doi.org/10.1145/3180270.3180275","url":null,"abstract":"Two key trends in computing are evident --- emergence of GPU as a first-class compute element and emergence of byte-addressable nonvolatile memory technologies (NVRAM) as DRAM-supplement. GPUs and NVRAMs are likely to coexist in future systems. However, previous works have either focused on GPUs or on NVRAMs, in isolation. In this work, we investigate the enhancements necessary for a GPU to efficiently and correctly manipulate NVRAM-resident persistent data structures. Specifically, we find that previously proposed CPU-centric persist barriers fall short for GPUs. We thus introduce the concept of scoped persist barriers that aligns with the hierarchical programming framework of GPUs. Scoped persist barriers enable GPU programmers to express which execution group (a.k.a., scope) a given persist barrier applies to. We demonstrate that: 1 use of narrower scope than algorithmically-required can lead to inconsistency of persistent data structure, and 2 use of wider scope than necessary leads to significant performance loss (e.g., 25% or more). Therefore, a future GPU can benefit from persist barriers with different scopes.","PeriodicalId":274320,"journal":{"name":"Proceedings of the 11th Workshop on General Purpose GPUs","volume":"88 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-02-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123192709","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"MaxPair: Enhance OpenCL Concurrent Kernel Execution by Weighted Maximum Matching","authors":"Yuan Wen, M. O’Boyle, Christian Fensch","doi":"10.1145/3180270.3180272","DOIUrl":"https://doi.org/10.1145/3180270.3180272","url":null,"abstract":"Executing multiple OpenCL kernels on the same GPU concurrently is a promising method for improving hardware utilisation and system performance. Schemes of scheduling impact the resulting performance significantly by selecting different kernels to run together on the same GPU. Existing approaches use either execution time or relative speedup of kernels as a guide to group and map them to the device. However, these simple methods work on the cost of providing suboptimal performance. In this paper, we propose a graph-based algorithm to schedule co-run kernel in pairs to optimise the system performance. Target workloads are represented by a graph, in which vertices stand for distinct kernels while edges between two vertices represent the corresponding two kernels co-execution can deliver a better performance than run them one after another. Edges are weighted to provide information of performance gain from co-execution. Our algorithm works in the way of finding out the maximum weighted matching of the graph. By maximising the accumulated weights, our algorithm improves performance significantly comparing to other approaches.","PeriodicalId":274320,"journal":{"name":"Proceedings of the 11th Workshop on General Purpose GPUs","volume":"15 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-02-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128155468","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Initial Steps toward Making GPU a First-Class Computing Resource: Sharing and Resource Management","authors":"Jun Yang","doi":"10.1145/3180270.3182629","DOIUrl":"https://doi.org/10.1145/3180270.3182629","url":null,"abstract":"GPUs have evolved from traditional graphics accelerators into core compute engines for a broad class of general-purpose applications. However, current commercial offerings fall short of the great potential of GPUs largely because they cannot be managed as easily as the CPU. The enormous amount of hardware resources are often greatly underutilized. We developed new architecture features to enable fine-grained sharing of GPUs, termed Simultaneous Multi-kernel (SMK), in a similar way the CPU achieves sharing via simultaneous multithreading (SMT). With SMK, different applications can co-exist in every streaming multiprocessor of a GPU, in a fully controlled way. High resource utilization can be achieved by exploiting heterogeneity of different application behaviors. Resource apportion among sharers are developed for fairness, throughput, and quality-of-services. We also envision that SMK can enable better manageability of GPUs and new features such as more efficient synchronization mechanisms within an application.","PeriodicalId":274320,"journal":{"name":"Proceedings of the 11th Workshop on General Purpose GPUs","volume":"21 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-02-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128345037","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}