GPGPU-3Pub Date : 2010-03-14DOI: 10.1145/1735688.1735690
J. Vetter
{"title":"Toward exascale computational science with heterogeneous processing","authors":"J. Vetter","doi":"10.1145/1735688.1735690","DOIUrl":"https://doi.org/10.1145/1735688.1735690","url":null,"abstract":"Computational requirements for scientific simulation continue to grow in scale and complexity. Meanwhile, HPC systems and centers are facing urgent constraints of power and thermal limits, while continuing to advance computational science. Our experiences show that heterogeneous systems can offer one possible solution for addressing these constraints for specific applications; however, they can also introduce the new challenges to programmer productivity. In this talk, I will review these benefits and challenges as they relate to DOE and NSF applications on graphics processors, and introduce an NSF-funded project to deploy an innovative supercomputer based on graphics processors.","PeriodicalId":381071,"journal":{"name":"GPGPU-3","volume":"33 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-03-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129832160","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
GPGPU-3Pub Date : 2010-03-14DOI: 10.1145/1735688.1735702
Anthony Danalis, G. Marin, Collin McCurdy, J. Meredith, P. Roth, Kyle Spafford, V. Tipparaju, J. Vetter
{"title":"The Scalable Heterogeneous Computing (SHOC) benchmark suite","authors":"Anthony Danalis, G. Marin, Collin McCurdy, J. Meredith, P. Roth, Kyle Spafford, V. Tipparaju, J. Vetter","doi":"10.1145/1735688.1735702","DOIUrl":"https://doi.org/10.1145/1735688.1735702","url":null,"abstract":"Scalable heterogeneous computing systems, which are composed of a mix of compute devices, such as commodity multicore processors, graphics processors, reconfigurable processors, and others, are gaining attention as one approach to continuing performance improvement while managing the new challenge of energy efficiency. As these systems become more common, it is important to be able to compare and contrast architectural designs and programming systems in a fair and open forum. To this end, we have designed the Scalable HeterOgeneous Computing benchmark suite (SHOC). SHOC's initial focus is on systems containing graphics processing units (GPUs) and multi-core processors, and on the new OpenCL programming standard. SHOC is a spectrum of programs that test the performance and stability of these scalable heterogeneous computing systems. At the lowest level, SHOC uses microbenchmarks to assess architectural features of the system. At higher levels, SHOC uses application kernels to determine system-wide performance including many system features such as intranode and internode communication among devices. SHOC includes benchmark implementations in both OpenCL and CUDA in order to provide a comparison of these programming models.","PeriodicalId":381071,"journal":{"name":"GPGPU-3","volume":"23 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-03-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126998536","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
GPGPU-3Pub Date : 2010-03-14DOI: 10.1145/1735688.1735708
F. Pratas, R. Mata, L. Sousa
{"title":"Iterative induced dipoles computation for molecular mechanics on GPUs","authors":"F. Pratas, R. Mata, L. Sousa","doi":"10.1145/1735688.1735708","DOIUrl":"https://doi.org/10.1145/1735688.1735708","url":null,"abstract":"In this work, we present a first step towards the efficient implementation of polarizable molecular mechanics force fields with GPU acceleration. The computational bottleneck of such applications is found in the treatment of electrostatics, where higher-order multipoles and a self-consistent treatment of polarization effects are needed. We have coded these sections, for the case of a non-periodic simulation, with the CUDA programming model. Results show a speedup factor of 21 for a single precision GPU implementation, when comparing to the serial CPU version. A discussion of the optimization and parameterization steps is included. Comparison between different graphic cards and a shared memory parallel CPU implementation is also given. The current work demonstrates the potential usefulness of GPU programming in accelerating this field of applications.","PeriodicalId":381071,"journal":{"name":"GPGPU-3","volume":"38 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-03-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115864108","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
GPGPU-3Pub Date : 2010-03-14DOI: 10.1145/1735688.1735693
Andrew Nere, Mikko H. Lipasti
{"title":"Cortical architectures on a GPGPU","authors":"Andrew Nere, Mikko H. Lipasti","doi":"10.1145/1735688.1735693","DOIUrl":"https://doi.org/10.1145/1735688.1735693","url":null,"abstract":"As the number of devices available per chip continues to increase, the computational potential of future computer architectures grows likewise. While this is a clear benefit for future computing devices, future chips will also likely suffer from more faulty devices and increased power consumption. It is also likely that these chips will be difficult to program if the current trend of adding more parallel cores continues to follow in the future. However, recent advances in neuroscientific understanding make parallel computing devices modeled after the human neocortex a plausible, attractive, fault-tolerant, and energy-efficient possibility.\u0000 In this paper we describe a GPGPU extension to an intelligent model based on the mammalian neocortex. The GPGPU is a readily-available architecture that fits well with the parallel cortical architecture inspired by the basic building blocks of the human brain. Using NVIDIA's CUDA framework, we have achieved up to 273x speedup over our unoptimized C++ serial implementation. We also consider two inefficiencies inherent to our initial design: multiple kernel-launch overhead and poor utilization of GPGPU resources. We propose using a software work-queue structure to solve the former, and pipelining the cortical architecture during training phase for the latter. Additionally, from our success in extending our model to the GPU, we speculate the necessary hardware requirements for simulating the computational abilities of mammalian brains.","PeriodicalId":381071,"journal":{"name":"GPGPU-3","volume":"691 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-03-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132229655","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
GPGPU-3Pub Date : 2010-03-14DOI: 10.1145/1735688.1735700
R. Linderman
{"title":"GPGPU role within a 500 TFLOPS heterogeneous cluster","authors":"R. Linderman","doi":"10.1145/1735688.1735700","DOIUrl":"https://doi.org/10.1145/1735688.1735700","url":null,"abstract":"The outstanding price-performance of GPGPU technology has made it a key architectural engine within a 500 TFLOPS Heterogeneous Cluster being assembled by the Air Force Research Laboratory in Rome, NY. This new machine will likely be the largest interactive HPC in the world and feature $4/GFLOPS overall system performance and 1.5 TFLOPS/KW power efficiency. The heterogeneous aspect of the cluster reflects a combination of roughly 300 TFLOPS performance from 2000 PS3 gaming consoles plus 200 TFLOPS from GPGPUs closely coupled to 84 headnodes of the subclusters within the overall machine.\u0000 The blend of GPGPUs, Cell processors within the PS3s, and Xeon processors in the headnodes is a deliberate mixing intended to offer an alternative programming environments suiting different applications, or combining on portions of applications. The large DRAM memory and local disk capacity of the multicore Xeon headnode is a familiar environment for handling a wide swath of the application codes with a popular computing environment. But for segments of applications requiring higher performance the Cell and GPGPU architectures are available for acceleration based on large scale parallelization.\u0000 This talk will discuss programming experiences to date on the GPGPUs, Cells, and Xeons and discuss the attributes of algorithms that would favor each of these aspects of the heterogeneous machine.","PeriodicalId":381071,"journal":{"name":"GPGPU-3","volume":"10 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-03-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123741472","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
GPGPU-3Pub Date : 2010-03-14DOI: 10.1145/1735688.1735703
J. Kong, Martin Dimitrov, Yi Yang, J. Liyanage, Lin Cao, Jacob Staples, Mike Mantor, Huiyang Zhou
{"title":"Accelerating MATLAB Image Processing Toolbox functions on GPUs","authors":"J. Kong, Martin Dimitrov, Yi Yang, J. Liyanage, Lin Cao, Jacob Staples, Mike Mantor, Huiyang Zhou","doi":"10.1145/1735688.1735703","DOIUrl":"https://doi.org/10.1145/1735688.1735703","url":null,"abstract":"In this paper, we present our effort in developing an open-source GPU (graphics processing units) code library for the MATLAB Image Processing Toolbox (IPT). We ported a dozen of representative functions from IPT and based on their inherent characteristics, we grouped these functions into four categories: data independent, data sharing, algorithm dependent and data dependent. For each category, we present a detailed case study, which reveals interesting insights on how to efficiently optimize the code for GPUs and highlight performance-critical hardware features, some of which have not been well explored in existing literature. Our results show drastic speedups for the functions in the data-independent or data-sharing category by leveraging hardware support judiciously; and moderate speedups for those in the algorithm-dependent category by careful algorithm selection and parallelization. For the functions in the last category, fine-grain synchronization and data-dependency requirements are the main obstacles to an efficient implementation on GPUs.","PeriodicalId":381071,"journal":{"name":"GPGPU-3","volume":"3 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-03-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116434847","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
GPGPU-3Pub Date : 2010-03-14DOI: 10.1145/1735688.1735692
Sergio Herrero-Lopez, John R. Williams, Abel Sanchez
{"title":"Parallel multiclass classification using SVMs on GPUs","authors":"Sergio Herrero-Lopez, John R. Williams, Abel Sanchez","doi":"10.1145/1735688.1735692","DOIUrl":"https://doi.org/10.1145/1735688.1735692","url":null,"abstract":"The scaling of serial algorithms cannot rely on the improvement of CPUs anymore. The performance of classical Support Vector Machine (SVM) implementations has reached its limit and the arrival of the multi core era requires these algorithms to adapt to a new parallel scenario. Graphics Processing Units (GPU) have arisen as high performance platforms to implement data parallel algorithms. In this paper, it is described how a naïve implementation of a multiclass classifier based on SVMs can map its inherent degrees of parallelism to the GPU programming model and efficiently use its computational throughput. Empirical results show that the training and classification time of the algorithm can be reduced an order of magnitude compared to a classical multiclass solver, LIBSVM, while guaranteeing the same accuracy.","PeriodicalId":381071,"journal":{"name":"GPGPU-3","volume":"16 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-03-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125308747","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
GPGPU-3Pub Date : 2010-03-14DOI: 10.1145/1735688.1735695
R. Garg, J. N. Amaral
{"title":"Compiling Python to a hybrid execution environment","authors":"R. Garg, J. N. Amaral","doi":"10.1145/1735688.1735695","DOIUrl":"https://doi.org/10.1145/1735688.1735695","url":null,"abstract":"A new compilation framework enables the execution of numerical-intensive applications, written in Python, on a hybrid execution environment formed by a CPU and a GPU. This compiler automatically computes the set of memory locations that need to be transferred to the GPU, and produces the correct mapping between the CPU and the GPU address spaces. Thus, the programming model implements a virtual shared address space. This framework is implemented as a combination of unPython, an ahead-of-time compiler from Python/NumPy to the C programming language, and jit4GPU, a just-in-time compiler from C to the AMD CAL interface. Experimental evaluation demonstrates that for some benchmarks the generated GPU code is 50 times faster than generated OpenMP code. The GPU performance also compares favorably with optimized CPU BLAS code for single-precision computations in most cases.","PeriodicalId":381071,"journal":{"name":"GPGPU-3","volume":"32 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-03-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125015674","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
GPGPU-3Pub Date : 2010-03-14DOI: 10.1145/1735688.1735705
S. Byna, Jiayuan Meng, A. Raghunathan, S. Chakradhar, S. Cadambi
{"title":"Best-effort semantic document search on GPUs","authors":"S. Byna, Jiayuan Meng, A. Raghunathan, S. Chakradhar, S. Cadambi","doi":"10.1145/1735688.1735705","DOIUrl":"https://doi.org/10.1145/1735688.1735705","url":null,"abstract":"Semantic indexing is a popular technique used to access and organize large amounts of unstructured text data. We describe an optimized implementation of semantic indexing and document search on manycore GPU platforms. We observed that a parallel implementation of semantic indexing on a 128-core Tesla C870 GPU is only 2.4X faster than a sequential implementation on an Intel Xeon 2.4GHz processor. We ascribe the less than spectacular speedup to a mismatch in the workload characteristics of semantic indexing and the unique architectural features of GPUs. Compared to the regular numerical computations that have been ported to GPUs with great success, our semantic indexing algorithm (the recently proposed Supervised Semantic Indexing algorithm called SSI) has interesting characteristics -- the amount of parallelism in each training instance is data-dependent, and each iteration involves the product of a dense matrix with a sparse vector, resulting in random memory access patterns. As a result, we observed that the baseline GPU implementation significantly under-utilizes the hardware resources (processing elements and memory bandwidth) of the GPU platform. However, the SSI algorithm also demonstrates unique characteristics, which we collectively refer to as the \"forgiving nature\" of the algorithm. These unique characteristics allow for novel optimizations that do not strive to preserve numerical equivalence of each training iteration with the sequential implementation. In particular, we consider best-effort computing techniques, such as dependency relaxation and computation dropping, to suitably alter the workload characteristics of SSI to leverage the unique architectural features of the GPU. We also show that the realization of dependency relaxation and computation dropping concepts on a GPU is quite different from how one would implement these concepts on a multicore CPU, largely due to the distinct architectural features supported by a GPU. Our new techniques dramatically enhance the amount of parallel workload, leading to much higher performance on the GPU. By optimizing data transfers between CPU and GPU, and by reducing GPU kernel invocation overheads, we achieve further performance gains. We evaluated our new GPU-accelerated implementation of semantic document search on a database of over 1.8 million documents from Wikipedia. By applying our novel performance-enhancing strategies, our GPU implementation on a 128-core Tesla C870 achieved a 5.5X acceleration as compared to a baseline parallel implementation on the same GPU. Compared to a baseline parallel TBB implementation on a dual-socket quad-core Intel Xeon multicore CPU (8-cores), the enhanced GPU implementation is 11X faster. Compared to a parallel implementation on the same multi-core CPU that also uses data dependency relaxation and dropping computation techniques, our enhanced GPU implementation is 5X faster.","PeriodicalId":381071,"journal":{"name":"GPGPU-3","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-03-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130216785","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
GPGPU-3Pub Date : 2010-03-14DOI: 10.1145/1735688.1735696
Andrew Kerr, G. Diamos, S. Yalamanchili
{"title":"Modeling GPU-CPU workloads and systems","authors":"Andrew Kerr, G. Diamos, S. Yalamanchili","doi":"10.1145/1735688.1735696","DOIUrl":"https://doi.org/10.1145/1735688.1735696","url":null,"abstract":"Heterogeneous systems, systems with multiple processors tailored for specialized tasks, are challenging programming environments. While it may be possible for domain experts to optimize a high performance application for a very specific and well documented system, it may not perform as well or even function on a different system. Developers who have less experience with either the application domain or the system architecture may devote a significant effort to writing a program that merely functions correctly. We believe that a comprehensive analysis and modeling frame-work is necessary to ease application development and automate program optimization on heterogeneous platforms.\u0000 This paper reports on an empirical evaluation of 25 CUDA applications on four GPUs and three CPUs, leveraging the Ocelot dynamic compiler infrastructure which can execute and instrument the same CUDA applications on either target. Using a combination of instrumentation and statistical analysis, we record 37 different metrics for each application and use them to derive relationships between program behavior and performance on heterogeneous processors. These relationships are then fed into a modeling frame-work that attempts to predict the performance of similar classes of applications on different processors. Most significantly, this study identifies several non-intuitive relationships between program characteristics and demonstrates that it is possible to accurately model CUDA kernel performance using only metrics that are available before a kernel is executed.","PeriodicalId":381071,"journal":{"name":"GPGPU-3","volume":"16 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-03-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127955061","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}