Abhinandan Majumdar, Gene Y. Wu, K. Dev, J. Greathouse, Indrani Paul, Wei Huang, Arjun Venugopal, Leonardo Piga, Chip Freitag, Sooraj Puthoor
{"title":"A Taxonomy of GPGPU Performance Scaling","authors":"Abhinandan Majumdar, Gene Y. Wu, K. Dev, J. Greathouse, Indrani Paul, Wei Huang, Arjun Venugopal, Leonardo Piga, Chip Freitag, Sooraj Puthoor","doi":"10.1109/IISWC.2015.22","DOIUrl":"https://doi.org/10.1109/IISWC.2015.22","url":null,"abstract":"Graphics processing units (GPUs) range from small, embedded designs to large, high-powered discrete cards. While the performance of graphics workloads is generally understood, there has been little study of the performance of GPGPU applications across a variety of hardware configurations. This work presents performance scaling data gathered for 267 GPGPU kernels from 97 programs run on 891 hardware configurations of a modern GPU. We study the performance of these kernels across a 5× change in core frequency, 8.3× change in memory bandwidth, and 11× difference in compute units. We illustrate that many kernels scale in intuitive ways, such as those that scale directly with added computational capabilities or memory bandwidth. We also find a number of kernels that scale in non-obvious ways, such as losing performance when more processing units are added or plateauing as frequency and bandwidth are increased. In addition, we show that a number of current benchmark suites do not scale to modern GPU sizes, implying that either new benchmarks or new inputs are warranted.","PeriodicalId":142698,"journal":{"name":"2015 IEEE International Symposium on Workload Characterization","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-10-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129295224","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Source Mark: A Source-Level Approach for Identifying Architecture and Optimization Agnostic Regions for Performance Analysis","authors":"Abhinav Agrawal, Bagus Wibowo, James Tuck","doi":"10.1109/IISWC.2015.27","DOIUrl":"https://doi.org/10.1109/IISWC.2015.27","url":null,"abstract":"Computer architects often evaluate performance on only parts of a program and not the entire program due to long simulation times that could take weeks or longer to finish. However, choosing regions of a program to evaluate in a way that is consistent and correct with respect to different compilers and different architectures is very challenging and has not received sufficient attention. The need for such tools is growing in importance given the diversity of architectures and compilers in use today. In this work, we propose a technique that identifies regions of a desired granularity for performance evaluation. We use a source-to-source compiler that inserts software marks into the program's source code to divide the execution into regions with a desired dynamic instruction count. An evaluation framework chooses from among a set of candidate marks to find ones that are both consistent across different architectures or compilers and can yield a low run-time instruction overhead. Evaluated on a set of SPEC applications, with a region size of about 100 million instructions, our technique has a dynamic instruction overhead as high as 3.3% with an average overhead of 0.47%. We also demonstrate the scalability of our technique by evaluating the dynamic instruction overhead for regions of finer granularity and show similar small overheads, of the applications we studied, we were unable to find suitable fine grained regions only for 462.libquantum and 444.namd. Our technique is an effective alternative to traditional binary-level approaches. We have demonstrated that a source-level approach is robust, that it can achieve low overhead, and that it reduces the effort for bringing up new architectures or compilers into an existing evaluation framework.","PeriodicalId":142698,"journal":{"name":"2015 IEEE International Symposium on Workload Characterization","volume":"83 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-10-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133021973","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A Retrospective Look Back on the Road Towards Energy Proportionality","authors":"Daniel Wong, Julia Chen, M. Annavaram","doi":"10.1109/IISWC.2015.18","DOIUrl":"https://doi.org/10.1109/IISWC.2015.18","url":null,"abstract":"In this paper, we take a retrospective look back at the road taken towards improving energy proportionality, in order to find out where we are currently, and how we got here. Through statistical regression of published SPEC power results, were able to identify and quantify the sources of past EP improvements.","PeriodicalId":142698,"journal":{"name":"2015 IEEE International Symposium on Workload Characterization","volume":"40 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-10-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123473870","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"GPU Computing Pipeline Inefficiencies and Optimization Opportunities in Heterogeneous CPU-GPU Processors","authors":"Joel Hestness, S. Keckler, D. Wood","doi":"10.1109/IISWC.2015.15","DOIUrl":"https://doi.org/10.1109/IISWC.2015.15","url":null,"abstract":"Emerging heterogeneous CPU-GPU processors have introduced unified memory spaces and cache coherence. CPU and GPU cores will be able to concurrently access the same memories, eliminating memory copy overheads and potentially changing the application-level optimization targets. To date, little is known about how developers may organize new applications to leverage the available, finer-grained communication in these processors. However, understanding potential application optimizations and adaptations is critical for directing heterogeneous processor programming model and architectural development. This paper quantifies opportunities for applications and architectures to evolve to leverage the new capabilities of heterogeneous processors. To identify these opportunities, we ported and simulated a broad set of benchmarks originally developed for discrete GPUs to remove memory copies, and applied analytical models to quantify their application-level pipeline inefficiencies. For existing benchmarks, GPU bulk-synchronous software pipelines result in considerable core and cache utilization inefficiency. For heterogeneous processors, the results indicate increased opportunity for techniques that provide flexible compute and data granularities, and support for efficient producer-consumer data handling and synchronization within caches.","PeriodicalId":142698,"journal":{"name":"2015 IEEE International Symposium on Workload Characterization","volume":"447 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-10-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122486526","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Melanie Kambadur, Sunpyo Hong, Juan Cabral, H. Patil, C. Luk, S. Sajid, Martha A. Kim
{"title":"Fast Computational GPU Design with GT-Pin","authors":"Melanie Kambadur, Sunpyo Hong, Juan Cabral, H. Patil, C. Luk, S. Sajid, Martha A. Kim","doi":"10.1109/IISWC.2015.14","DOIUrl":"https://doi.org/10.1109/IISWC.2015.14","url":null,"abstract":"As computational applications become common for graphics processing units, new hardware designs must be developed to meet the unique needs of these workloads. Performance simulation is an important step in appraising how well a candidate design will serve these needs, but unfortunately, computational GPU programs are so large that simulating them in detail is prohibitively slow. This work addresses the need to understand very large computational GPU programs in three ways. First, it introduces a fast tracing tool that uses binary instrumentation for in-depth analyses of native executions on existing architectures. Second, it characterizes 25 commercial and benchmark OpenCL applications, which average 308 billion GPU instructions apiece and are by far the largest benchmarks that have been natively profiled at this level of detail. Third, it accelerates simulation of future hardware by pinpointing small subsets of OpenCL applications that can be simulated as representative surrogates in lieu of full-length programs. Our fast selection method requires no simulation itself and allows the user to navigate the accuracy/simulation speed trade-off space, from extremely accurate with reasonable speedups (35X increase in simulation speed for 0.3% error) to reasonably accurate with extreme speedups (223X simulation speedup for 3.0% error).","PeriodicalId":142698,"journal":{"name":"2015 IEEE International Symposium on Workload Characterization","volume":"22 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-10-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115614260","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Locality Exists in Graph Processing: Workload Characterization on an Ivy Bridge Server","authors":"S. Beamer, K. Asanović, D. Patterson","doi":"10.1109/IISWC.2015.12","DOIUrl":"https://doi.org/10.1109/IISWC.2015.12","url":null,"abstract":"Graph processing is an increasingly important application domain and is typically communication-bound. In this work, we analyze the performance characteristics of three high-performance graph algorithm codebases using hardware performance counters on a conventional dual-socket server. Unlike many other communication-bound workloads, graph algorithms struggle to fully utilize the platform's memory bandwidth and so increasing memory bandwidth utilization could be just as effective as decreasing communication. Based on our observations of simultaneous low compute and bandwidth utilization, we find there is substantial room for a different processor architecture to improve performance without requiring a new memory system.","PeriodicalId":142698,"journal":{"name":"2015 IEEE International Symposium on Workload Characterization","volume":"19 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-10-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130085004","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
R. Begum, David Werner, Mark Hempstead, Guru Prasad Srinivasa, Geoffrey Challen
{"title":"Energy-Performance Trade-offs on Energy-Constrained Devices with Multi-component DVFS","authors":"R. Begum, David Werner, Mark Hempstead, Guru Prasad Srinivasa, Geoffrey Challen","doi":"10.1109/IISWC.2015.10","DOIUrl":"https://doi.org/10.1109/IISWC.2015.10","url":null,"abstract":"Battery lifetime continues to be a top complaint about smart phones. Dynamic voltage and frequency scaling (DVFS) has existed for mobile device CPUs for some time, and provides a trade off between energy and performance. Dynamic frequency scaling is beginning to be applied to memory as well to make more energy-performance tradeoffs possible. We present the first characterization of the behavior of the optimal frequency settings of workloads running both, under energy constraints and on systems capable of CPU DVFS and memory DFS, an environment representative of next-generation mobile devices. Our results show that continuously using the optimal frequency settings results in a large number of frequency transitions which end up hurting performance. However, by permitting a small loss in performance, transition overhead can be reduced and end-to-end performance and energy consumption improved. We introduce the idea of inefficiency as a way of constraining task energy consumption relative to the most energy-efficient settings, and characterize the performance of multiple workloads running under different inefficiency settings. Overall our results have multiple implications for next-generation mobile devices exposing multiple energy-performance tradeoffs.","PeriodicalId":142698,"journal":{"name":"2015 IEEE International Symposium on Workload Characterization","volume":"77 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-10-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133156916","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Characterization and Throttling-Based Mitigation of Memory Interference for Heterogeneous Smartphones","authors":"Davesh Shingari, A. Arunkumar, Carole-Jean Wu","doi":"10.1109/IISWC.2015.9","DOIUrl":"https://doi.org/10.1109/IISWC.2015.9","url":null,"abstract":"The availability of a wide range of general purpose as well as accelerator cores on modern smart phones means that a significant number of applications can be executed on a smart phone simultaneously, resulting in an ever increasing demand on the memory subsystem. While the increased computation capability is intended for improving user experience, memory requests from each concurrent application exhibit unique memory access patterns as well as specific timing constraints. If not considered, this could lead to significant memory contention and result in lowered user experience. In this paper, we design experiments to analyze the performance degradation caused by the interference at the memory subsystem for a broad range of commonly-used smart phone applications. The characterization studies are performed on a real smart phone device -- Google Nexus5 -- running an Android operating system. Our results show that user-centric smart phone applications, such as web browsing and media player, suffer up to 34% and 21% performance degradation, respectively, from shared resource contention at the application processor's last-level cache, the communication fabric, and the main memory. Taking a step further, we demonstrate the feasibility and effectiveness of a frequency throttling-based memory interference mitigation technique. At the expense of performance degradation of interfering applications, frequency throttling is an effective technique for mitigating memory interference, leading to better QoS and user experience, for user-centric applications.","PeriodicalId":142698,"journal":{"name":"2015 IEEE International Symposium on Workload Characterization","volume":"164 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-10-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121222253","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Big or Little: A Study of Mobile Interactive Applications on an Asymmetric Multi-core Platform","authors":"Wonik Seo, Daegil Im, Jeongim Choi, Jaehyuk Huh","doi":"10.1109/IISWC.2015.7","DOIUrl":"https://doi.org/10.1109/IISWC.2015.7","url":null,"abstract":"This paper characterizes a commercial mobile platform on an asymmetric multi-core processor, investigating its available thread-level parallelism (TLP) and the impact of core asymmetry on applications. This paper explores three critical aspects of asymmetric mobile systems, asymmetric hardware platform, application behavior, and the impact of scheduling and power management. First, this paper presents the performance and energy characteristics of a commercial asymmetric multi-core architecture with two core types. The comparison between big and little cores shows the potential benefit of asymmetric multi-cores for improving energy efficiency. Second, the paper investigates the available thread-level parallelism and core utilization behaviors of mobile interactive applications. Using popular mobile applications for the Android system, this paper analyzes the distinct TLP and CPU usage patterns of interactive applications. Third, the paper explores the impact of power governor and CPU scheduler on the asymmetric system. Multiple cores with heterogeneous core types complicate scheduling and frequency scaling schemes, since the scheduler must migrate threads to different core types, in addition to traditional load balancing. This study shows that the current mobile applications are not fully utilizing the asymmetric multi-cores due to the lack of TLP and low computational requirement for big cores.","PeriodicalId":142698,"journal":{"name":"2015 IEEE International Symposium on Workload Characterization","volume":"9 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-10-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121307642","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"PC Design, Use, and Purchase Relations","authors":"Al M. Rashid, B. Kuhn, B. Arbab, D. Kuck","doi":"10.1109/IISWC.2015.25","DOIUrl":"https://doi.org/10.1109/IISWC.2015.25","url":null,"abstract":"For 25 years, industry standard benchmarks have proliferated, attempting to approximate user activities. This has helped drive the success of PCs to commodity levels by characterizing apps for designers and offering performance information for users. However, the many new configurations of each PC release cycle often leave users unsure about how to choose one. This paper takes a different approach, with tools based on new metrics to analyze real usage by millions of people. Our goal is to develop a methodology for deeper understanding of usage that can help designers satisfy users. These metrics demonstrate that usages are uniformly different between high- and low-end CPU-based systems, regardless of why a user bought a given system. We outline how this data can be used to partition markets and make more effective hardware (hw) and software (sw) design decisions tailoring systems for prospective markets.","PeriodicalId":142698,"journal":{"name":"2015 IEEE International Symposium on Workload Characterization","volume":"66 2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-10-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129998033","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}