{"title":"Efficient Instrumentation of GPGPU Applications Using Information Flow Analysis and Symbolic Execution","authors":"N. Farooqui, K. Schwan, S. Yalamanchili","doi":"10.1145/2588768.2576782","DOIUrl":"https://doi.org/10.1145/2588768.2576782","url":null,"abstract":"Dynamic instrumentation of GPGPU binaries makes possible real-time introspection methods for performance debugging, correctness checks, workload characterization, and runtime optimization. Such instrumentation involves inserting code at the instruction level of an application, while the application is running, thereby able to accurately profile data-dependent application behavior. Runtime overheads seen from instrumentation, however, can obviate its utility. This paper shows how a combination of information flow analysis and symbolic execution can be used to alleviate these overheads. The methods and their effectiveness are demonstrated for a variety of GPGPU codes written in OpenCL that run on AMD GPU target backends. Kernels that can be analyzed entirely via symbolic execution need not be instrumented, thus eliminating kernel runtime overheads altogether. For the remaining GPU kernels, our results show 5-38% improvements in kernel runtime overheads.","PeriodicalId":394600,"journal":{"name":"Proceedings of Workshop on General Purpose Processing Using GPUs","volume":"6 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121776242","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Tahir Diop, Natalie D. Enright Jerger, J. Anderson
{"title":"Power Modeling for Heterogeneous Processors","authors":"Tahir Diop, Natalie D. Enright Jerger, J. Anderson","doi":"10.1145/2588768.2576790","DOIUrl":"https://doi.org/10.1145/2588768.2576790","url":null,"abstract":"As power becomes an ever more important design consideration, there is a need for accurate power models at all stages of the design process. While power models are available for CPUs and GPUs, only simple models are available for heterogeneous processors. We present a micro-benchmark-based modeling technique that can be used for chip multiprocessor (CMPs) and accelerated processing units (APUs). We use our approach to model power on an Intel Xeon CPU and an AMD Fusion heterogeneous processor. The resulting error rate for the Xeon's model is below 3% and is only 7% for the Fusion. We also present a method to reduce the number of benchmarks required to create these models. Instead of running micro-benchmarks for every combination of factors (e.g. different operations or memory access patterns), we cluster similar micro-benchmarks to avoid unnecessary simulations. We show that it is possible to eliminate as many as 93% of the compute micro-benchmarks, while still producing power models having less than 10% error rate.","PeriodicalId":394600,"journal":{"name":"Proceedings of Workshop on General Purpose Processing Using GPUs","volume":"11 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127298387","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"ad-heap: an Efficient Heap Data Structure for Asymmetric Multicore Processors","authors":"Weifeng Liu, B. Vinter","doi":"10.1145/2588768.2576786","DOIUrl":"https://doi.org/10.1145/2588768.2576786","url":null,"abstract":"Heap is one of the most important fundamental data structures in computer science. Unfortunately, for a long time heaps did not obtain ideal performance gain from widely used throughput-oriented processors because of two reasons: (1) heap property decides that operations between any parent node and its child nodes must be executed sequentially, and (2) heaps, even d-heaps (d-ary heaps or d-way heaps), cannot supply enough wide data parallelism to these processors. Recent research proposed more versatile asymmetric multicore processors (AMPs) that consist of two types of cores (latency-oriented cores with high single-thread performance and throughput-oriented cores with wide vector processing capability), unified memory address space and faster synchronization mechanism among cores with different ISAs. To leverage the AMPs for the heap data structure, in this paper we propose ad-heap, an efficient heap data structure that introduces an implicit bridge structure and properly apportions workloads to the two types of cores. We implement a batch k-selection algorithm and conduct experiments on simulated AMP environments composed of real CPUs and GPUs. In our experiments on two representative platforms, the ad-heap obtains up to 1.5x and 3.6x speedup over the optimal AMP scheduling method that executes the fastest d-heaps on the standalone CPUs and GPUs in parallel.","PeriodicalId":394600,"journal":{"name":"Proceedings of Workshop on General Purpose Processing Using GPUs","volume":"426 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132060592","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Performance Evaluation and Optimization Mechanisms for Inter-operable Graphics and Computation on GPUs","authors":"Yash Ukidave, Xiang Gong, D. Kaeli","doi":"10.1145/2588768.2576784","DOIUrl":"https://doi.org/10.1145/2588768.2576784","url":null,"abstract":"Graphics Processing Units (GPUs) have gained recognition as the primary form of accelerators for graphics rendering in the gaming domain. They have also been widely accepted as the computing platform of choice in many scientific and high performance computing domains. The parallelism offered by the GPUs is used for simultaneous processing of compute and graphics by applications belonging to a range of domains. The availability of programming standards such as OpenCL and OpenGL has been leveraged to achieve the compute-graphics interoperability in the same application. However, given the increasing demands in both compute and graphics for emerging scientific visualization and immersive gaming applications, degradation in efficiency can be seen due to the continual switching between compute/graphics, swapping in and out of their associated runtime environments. We need to better understand how to tune this interoperable environment in order to allow compute and graphics to run both efficiently and simultaneously. Presently we evaluate each of these domains in isolation. In this paper, we evaluate the performance and efficiency of the OpenCL-OpenGL(CL-GL) interoperability(interop) mode. We explore different methods to improve the execution performance of the CL-GL interop-based applications. We propose a slot-based rendering mechanism for CL-GL interop to increase the efficiency of the application. To evaluate CL-GL and our slot-based scheme, we study five scientific applications using OpenCL and OpenGL for compute and graphics rendering. Our study covers two AMD Radeon discrete GPUs and one shared memory AMD APU as test platforms. We demonstrate that leveraging the CL-GL interop interface results in a 2.2X performance increase, and our slot-based rendering provides 60% increase in performance by providing a 24% improvement in L2 cache hit rate on GPUs and APUs.","PeriodicalId":394600,"journal":{"name":"Proceedings of Workshop on General Purpose Processing Using GPUs","volume":"6 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126103363","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Exploiting GPU Hardware Saturation for Fast Compiler Optimization","authors":"A. Magni, Christophe Dubach, M. O’Boyle","doi":"10.1145/2588768.2576791","DOIUrl":"https://doi.org/10.1145/2588768.2576791","url":null,"abstract":"Graphics Processing Units (GPUs) are efficient devices capable of delivering high performance for general purpose computation. Realizing their full performance potential often requires extensive compiler tuning. This process is particularly expensive since it has to be repeated for each target program and platform. In this paper we study the utilization of GPU hardware resources across multiple input sizes and compiler options. In this context we introduce the notion of hardware saturation. Saturation is reached when an application is executed with a number of threads large enough to fully utilize the available hardware resources. We give experimental evidence of hardware saturation and describe its properties using 16 OpenCL kernels on 3 GPUs from Nvidia and AMD. We show that input sizes that saturates the GPU show performance stability across compiler transformations. Using the thread-coarsening transformation as an example, we show that compiler settings maintain their relative performance across input sizes within the saturation region. Leveraging these hardware and software properties we propose a technique to identify the input size at the lower bound of the saturation zone, we call it Minimum Saturation Point (MSP). By performing iterative compilation on the MSP input size we obtain results effectively applicable for much large input problems reducing the overhead of tuning by an order of magnitude on average.","PeriodicalId":394600,"journal":{"name":"Proceedings of Workshop on General Purpose Processing Using GPUs","volume":"25 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130310101","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
R. Spliet, Lee W. Howes, Benedict R. Gaster, A. Varbanescu
{"title":"KMA: A Dynamic Memory Manager for OpenCL","authors":"R. Spliet, Lee W. Howes, Benedict R. Gaster, A. Varbanescu","doi":"10.1145/2588768.2576781","DOIUrl":"https://doi.org/10.1145/2588768.2576781","url":null,"abstract":"OpenCL is becoming a popular choice for the parallel programming of both multi-core CPUs and GPGPUs. One of the features missing in OpenCL, yet commonly required in irregular parallel applications, is dynamic memory allocation. In this paper, we propose KMA, a first dynamic memory allocator for OpenCL. KMA's design is based on a thorough analysis of a set of 11 algorithms, which shows that dynamic memory allocation is a necessary commodity, typically used for implementing complex data structures (arrays, lists, trees) that need constant restructuring at runtime. Taking into account both the survey findings and the status-quo of OpenCL, we design KMA as a two-layer memory manager that makes smart use of the patterns we identified in our application analysis: its basic functionality provides generic malloc() and free() APIs, while the higher layer provides support for building and efficiently managing dynamic data structures. Our experiments measure the performance and usability of KMA, using both microbenchmarks and a real-life case-study. Results show that when dynamic allocation is mandatory, KMA is a competitive allocator. We conclude that embedding dynamic memory allocation in OpenCL is feasible, but it is a complex, delicate task due to the massive parallelism of the platform and the portability issues between different OpenCL implementations.","PeriodicalId":394600,"journal":{"name":"Proceedings of Workshop on General Purpose Processing Using GPUs","volume":"43 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125702972","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
JeeWhan Choi, Aparna Chandramowlishwaran, Kamesh Madduri, R. Vuduc
{"title":"A CPU: GPU Hybrid Implementation and Model-Driven Scheduling of the Fast Multipole Method","authors":"JeeWhan Choi, Aparna Chandramowlishwaran, Kamesh Madduri, R. Vuduc","doi":"10.1145/2588768.2576787","DOIUrl":"https://doi.org/10.1145/2588768.2576787","url":null,"abstract":"This paper presents an optimized CPU--GPU hybrid implementation and a GPU performance model for the kernel-independent fast multipole method (FMM). We implement an optimized kernel-independent FMM for GPUs, and combine it with our previous CPU implementation to create a hybrid CPU+GPU FMM kernel. When compared to another highly optimized GPU implementation, our implementation achieves as much as a 1.9× speedup. We then extend our previous lower bound analyses of FMM for CPUs to include GPUs. This yields a model for predicting the execution times of the different phases of FMM. Using this information, we estimate the execution times of a set of static hybrid schedules on a given system, which allows us to automatically choose the schedule that yields the best performance. In the best case, we achieve a speedup of 1.5× compared to our GPU-only implementation, despite the large difference in computational powers of CPUs and GPUs. We comment on one consequence of having such performance models, which is to enable speculative predictions about FMM scalability on future systems.","PeriodicalId":394600,"journal":{"name":"Proceedings of Workshop on General Purpose Processing Using GPUs","volume":"7 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132593802","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"ParallelJS: An Execution Framework for JavaScript on Heterogeneous Systems","authors":"Jin Wang, Norman Rubin, S. Yalamanchili","doi":"10.1145/2588768.2576788","DOIUrl":"https://doi.org/10.1145/2588768.2576788","url":null,"abstract":"JavaScript has been recognized as one of the most widely used script languages. Optimizations of JavaScript engines on mainstream web browsers enable efficient execution of JavaScript programs on CPUs. However, running JavaScript applications on emerging heterogeneous architectures that feature massively parallel hardware such as GPUs has not been well studied. This paper proposes a framework for flexible mapping of JavaScript onto heterogeneous systems that have both CPUs and GPUs. The framework includes a frontend compiler, a construct library and a runtime system. JavaScript programs written with high-level constructs are compiled to GPU binary code and scheduled to GPUs by the runtime. Experiments show that the proposed framework achieves up to 26.8x speedup executing JavaScript applications on parallel GPUs over a mainstream web browser that runs on CPUs.","PeriodicalId":394600,"journal":{"name":"Proceedings of Workshop on General Purpose Processing Using GPUs","volume":"44 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128939615","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Adwait Jog, Evgeny Bolotin, Zvika Guz, Mike Parker, S. Keckler, M. Kandemir, C. Das
{"title":"Application-aware Memory System for Fair and Efficient Execution of Concurrent GPGPU Applications","authors":"Adwait Jog, Evgeny Bolotin, Zvika Guz, Mike Parker, S. Keckler, M. Kandemir, C. Das","doi":"10.1145/2588768.2576780","DOIUrl":"https://doi.org/10.1145/2588768.2576780","url":null,"abstract":"The available computing resources in modern GPUs are growing with each new generation. However, as many general purpose applications with limited thread-scalability are tuned to take advantage of GPUs, available compute resources might not be optimally utilized. To address this, modern GPUs will need to execute multiple kernels simultaneously. As current generations of GPUs (e.g., NVIDIA Kepler, AMD Radeon) already enable concurrent execution of kernels from the same application, in this paper we address the next logical step: executing multiple concurrent applications in GPUs. We show that while this paradigm has a potential to improve the overall system performance, negative interactions among concurrently executing applications in the memory system can severely hamper the performance and fairness among applications. We show that the current application agnostic GPU memory system design can (1) lead to sub-optimal GPU performance; and (2) create significant imbalance in performance slowdowns across kernels. Thus, we argue that GPU memory system should be augmented with application awareness. As one example to the applicability of this concept, we augment the memory system hardware with application awareness such that requests from different applications can be scheduled in a round robin (RR) fashion while still preserving the benefits of the current first-ready FCFS (FR-FCFS) memory scheduling policy. Evaluations with different multi-application workloads demonstrate that the proposed memory scheduling policy, first-ready round-robin FCFS (FR-RR-FCFS), improves fairness and delivers better system performance compared to the existing FR-FCFS memory scheduling scheme.","PeriodicalId":394600,"journal":{"name":"Proceedings of Workshop on General Purpose Processing Using GPUs","volume":"35 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133208434","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Measuring GPU Power with the K20 Built-in Sensor","authors":"Martin Burtscher, I. Zecena, Ziliang Zong","doi":"10.1145/2588768.2576783","DOIUrl":"https://doi.org/10.1145/2588768.2576783","url":null,"abstract":"GPU-accelerated programs are becoming increasingly common in HPC, personal computers, and even handheld devices, making it important to optimize their energy efficiency. However, accurately profiling the power consumption of GPU code is not straightforward. In fact, we have identified multiple anomalies when using the on-board power sensor of K20 GPUs. For example, we have found that doubling a kernel's runtime more than doubles its energy usage, that kernels consume energy after they have stopped executing, and that running two kernels in close temporal proximity inflates the energy consumption of the later kernel. Moreover, we have observed that the power sampling frequency varies greatly and that the GPU sensor only performs power readings once in a while. We present a methodology to accurately compute the instant power and the energy consumption despite these issues.","PeriodicalId":394600,"journal":{"name":"Proceedings of Workshop on General Purpose Processing Using GPUs","volume":"22 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125851647","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}