2015 IEEE International Symposium on Workload Characterization最新文献

A Taxonomy of GPGPU Performance Scaling GPGPU性能扩展的分类

2015 IEEE International Symposium on Workload Characterization Pub Date : 2015-10-04 DOI: 10.1109/IISWC.2015.22

Abhinandan Majumdar, Gene Y. Wu, K. Dev, J. Greathouse, Indrani Paul, Wei Huang, Arjun Venugopal, Leonardo Piga, Chip Freitag, Sooraj Puthoor

{"title":"A Taxonomy of GPGPU Performance Scaling","authors":"Abhinandan Majumdar, Gene Y. Wu, K. Dev, J. Greathouse, Indrani Paul, Wei Huang, Arjun Venugopal, Leonardo Piga, Chip Freitag, Sooraj Puthoor","doi":"10.1109/IISWC.2015.22","DOIUrl":"https://doi.org/10.1109/IISWC.2015.22","url":null,"abstract":"Graphics processing units (GPUs) range from small, embedded designs to large, high-powered discrete cards. While the performance of graphics workloads is generally understood, there has been little study of the performance of GPGPU applications across a variety of hardware configurations. This work presents performance scaling data gathered for 267 GPGPU kernels from 97 programs run on 891 hardware configurations of a modern GPU. We study the performance of these kernels across a 5× change in core frequency, 8.3× change in memory bandwidth, and 11× difference in compute units. We illustrate that many kernels scale in intuitive ways, such as those that scale directly with added computational capabilities or memory bandwidth. We also find a number of kernels that scale in non-obvious ways, such as losing performance when more processing units are added or plateauing as frequency and bandwidth are increased. In addition, we show that a number of current benchmark suites do not scale to modern GPU sizes, implying that either new benchmarks or new inputs are warranted.","PeriodicalId":142698,"journal":{"name":"2015 IEEE International Symposium on Workload Characterization","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-10-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129295224","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 13

Source Mark: A Source-Level Approach for Identifying Architecture and Optimization Agnostic Regions for Performance Analysis 源代码标记:用于识别用于性能分析的架构和优化不可知区域的源代码级方法

2015 IEEE International Symposium on Workload Characterization Pub Date : 2015-10-04 DOI: 10.1109/IISWC.2015.27

Abhinav Agrawal, Bagus Wibowo, James Tuck

{"title":"Source Mark: A Source-Level Approach for Identifying Architecture and Optimization Agnostic Regions for Performance Analysis","authors":"Abhinav Agrawal, Bagus Wibowo, James Tuck","doi":"10.1109/IISWC.2015.27","DOIUrl":"https://doi.org/10.1109/IISWC.2015.27","url":null,"abstract":"Computer architects often evaluate performance on only parts of a program and not the entire program due to long simulation times that could take weeks or longer to finish. However, choosing regions of a program to evaluate in a way that is consistent and correct with respect to different compilers and different architectures is very challenging and has not received sufficient attention. The need for such tools is growing in importance given the diversity of architectures and compilers in use today. In this work, we propose a technique that identifies regions of a desired granularity for performance evaluation. We use a source-to-source compiler that inserts software marks into the program's source code to divide the execution into regions with a desired dynamic instruction count. An evaluation framework chooses from among a set of candidate marks to find ones that are both consistent across different architectures or compilers and can yield a low run-time instruction overhead. Evaluated on a set of SPEC applications, with a region size of about 100 million instructions, our technique has a dynamic instruction overhead as high as 3.3% with an average overhead of 0.47%. We also demonstrate the scalability of our technique by evaluating the dynamic instruction overhead for regions of finer granularity and show similar small overheads, of the applications we studied, we were unable to find suitable fine grained regions only for 462.libquantum and 444.namd. Our technique is an effective alternative to traditional binary-level approaches. We have demonstrated that a source-level approach is robust, that it can achieve low overhead, and that it reduces the effort for bringing up new architectures or compilers into an existing evaluation framework.","PeriodicalId":142698,"journal":{"name":"2015 IEEE International Symposium on Workload Characterization","volume":"83 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-10-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133021973","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

A Retrospective Look Back on the Road Towards Energy Proportionality 能源比例之路的回顾

2015 IEEE International Symposium on Workload Characterization Pub Date : 2015-10-04 DOI: 10.1109/IISWC.2015.18

Daniel Wong, Julia Chen, M. Annavaram

引用次数: 3

GPU Computing Pipeline Inefficiencies and Optimization Opportunities in Heterogeneous CPU-GPU Processors 异构CPU-GPU处理器中的GPU计算管道效率低下和优化机会

2015 IEEE International Symposium on Workload Characterization Pub Date : 2015-10-04 DOI: 10.1109/IISWC.2015.15

Joel Hestness, S. Keckler, D. Wood

{"title":"GPU Computing Pipeline Inefficiencies and Optimization Opportunities in Heterogeneous CPU-GPU Processors","authors":"Joel Hestness, S. Keckler, D. Wood","doi":"10.1109/IISWC.2015.15","DOIUrl":"https://doi.org/10.1109/IISWC.2015.15","url":null,"abstract":"Emerging heterogeneous CPU-GPU processors have introduced unified memory spaces and cache coherence. CPU and GPU cores will be able to concurrently access the same memories, eliminating memory copy overheads and potentially changing the application-level optimization targets. To date, little is known about how developers may organize new applications to leverage the available, finer-grained communication in these processors. However, understanding potential application optimizations and adaptations is critical for directing heterogeneous processor programming model and architectural development. This paper quantifies opportunities for applications and architectures to evolve to leverage the new capabilities of heterogeneous processors. To identify these opportunities, we ported and simulated a broad set of benchmarks originally developed for discrete GPUs to remove memory copies, and applied analytical models to quantify their application-level pipeline inefficiencies. For existing benchmarks, GPU bulk-synchronous software pipelines result in considerable core and cache utilization inefficiency. For heterogeneous processors, the results indicate increased opportunity for techniques that provide flexible compute and data granularities, and support for efficient producer-consumer data handling and synchronization within caches.","PeriodicalId":142698,"journal":{"name":"2015 IEEE International Symposium on Workload Characterization","volume":"447 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-10-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122486526","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 35

Fast Computational GPU Design with GT-Pin 基于gt引脚的快速计算GPU设计

2015 IEEE International Symposium on Workload Characterization Pub Date : 2015-10-04 DOI: 10.1109/IISWC.2015.14

Melanie Kambadur, Sunpyo Hong, Juan Cabral, H. Patil, C. Luk, S. Sajid, Martha A. Kim

{"title":"Fast Computational GPU Design with GT-Pin","authors":"Melanie Kambadur, Sunpyo Hong, Juan Cabral, H. Patil, C. Luk, S. Sajid, Martha A. Kim","doi":"10.1109/IISWC.2015.14","DOIUrl":"https://doi.org/10.1109/IISWC.2015.14","url":null,"abstract":"As computational applications become common for graphics processing units, new hardware designs must be developed to meet the unique needs of these workloads. Performance simulation is an important step in appraising how well a candidate design will serve these needs, but unfortunately, computational GPU programs are so large that simulating them in detail is prohibitively slow. This work addresses the need to understand very large computational GPU programs in three ways. First, it introduces a fast tracing tool that uses binary instrumentation for in-depth analyses of native executions on existing architectures. Second, it characterizes 25 commercial and benchmark OpenCL applications, which average 308 billion GPU instructions apiece and are by far the largest benchmarks that have been natively profiled at this level of detail. Third, it accelerates simulation of future hardware by pinpointing small subsets of OpenCL applications that can be simulated as representative surrogates in lieu of full-length programs. Our fast selection method requires no simulation itself and allows the user to navigate the accuracy/simulation speed trade-off space, from extremely accurate with reasonable speedups (35X increase in simulation speed for 0.3% error) to reasonably accurate with extreme speedups (223X simulation speedup for 3.0% error).","PeriodicalId":142698,"journal":{"name":"2015 IEEE International Symposium on Workload Characterization","volume":"22 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-10-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115614260","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 23

Locality Exists in Graph Processing: Workload Characterization on an Ivy Bridge Server 局部性存在于图处理:Ivy Bridge服务器的工作负载表征

2015 IEEE International Symposium on Workload Characterization Pub Date : 2015-10-04 DOI: 10.1109/IISWC.2015.12

S. Beamer, K. Asanović, D. Patterson

引用次数: 148

Energy-Performance Trade-offs on Energy-Constrained Devices with Multi-component DVFS 基于多组件DVFS的能量受限设备的能量性能权衡

2015 IEEE International Symposium on Workload Characterization Pub Date : 2015-10-04 DOI: 10.1109/IISWC.2015.10

R. Begum, David Werner, Mark Hempstead, Guru Prasad Srinivasa, Geoffrey Challen

{"title":"Energy-Performance Trade-offs on Energy-Constrained Devices with Multi-component DVFS","authors":"R. Begum, David Werner, Mark Hempstead, Guru Prasad Srinivasa, Geoffrey Challen","doi":"10.1109/IISWC.2015.10","DOIUrl":"https://doi.org/10.1109/IISWC.2015.10","url":null,"abstract":"Battery lifetime continues to be a top complaint about smart phones. Dynamic voltage and frequency scaling (DVFS) has existed for mobile device CPUs for some time, and provides a trade off between energy and performance. Dynamic frequency scaling is beginning to be applied to memory as well to make more energy-performance tradeoffs possible. We present the first characterization of the behavior of the optimal frequency settings of workloads running both, under energy constraints and on systems capable of CPU DVFS and memory DFS, an environment representative of next-generation mobile devices. Our results show that continuously using the optimal frequency settings results in a large number of frequency transitions which end up hurting performance. However, by permitting a small loss in performance, transition overhead can be reduced and end-to-end performance and energy consumption improved. We introduce the idea of inefficiency as a way of constraining task energy consumption relative to the most energy-efficient settings, and characterize the performance of multiple workloads running under different inefficiency settings. Overall our results have multiple implications for next-generation mobile devices exposing multiple energy-performance tradeoffs.","PeriodicalId":142698,"journal":{"name":"2015 IEEE International Symposium on Workload Characterization","volume":"77 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-10-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133156916","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 32

Characterization and Throttling-Based Mitigation of Memory Interference for Heterogeneous Smartphones 异构智能手机内存干扰的表征和节流抑制

2015 IEEE International Symposium on Workload Characterization Pub Date : 2015-10-04 DOI: 10.1109/IISWC.2015.9

Davesh Shingari, A. Arunkumar, Carole-Jean Wu

{"title":"Characterization and Throttling-Based Mitigation of Memory Interference for Heterogeneous Smartphones","authors":"Davesh Shingari, A. Arunkumar, Carole-Jean Wu","doi":"10.1109/IISWC.2015.9","DOIUrl":"https://doi.org/10.1109/IISWC.2015.9","url":null,"abstract":"The availability of a wide range of general purpose as well as accelerator cores on modern smart phones means that a significant number of applications can be executed on a smart phone simultaneously, resulting in an ever increasing demand on the memory subsystem. While the increased computation capability is intended for improving user experience, memory requests from each concurrent application exhibit unique memory access patterns as well as specific timing constraints. If not considered, this could lead to significant memory contention and result in lowered user experience. In this paper, we design experiments to analyze the performance degradation caused by the interference at the memory subsystem for a broad range of commonly-used smart phone applications. The characterization studies are performed on a real smart phone device -- Google Nexus5 -- running an Android operating system. Our results show that user-centric smart phone applications, such as web browsing and media player, suffer up to 34% and 21% performance degradation, respectively, from shared resource contention at the application processor's last-level cache, the communication fabric, and the main memory. Taking a step further, we demonstrate the feasibility and effectiveness of a frequency throttling-based memory interference mitigation technique. At the expense of performance degradation of interfering applications, frequency throttling is an effective technique for mitigating memory interference, leading to better QoS and user experience, for user-centric applications.","PeriodicalId":142698,"journal":{"name":"2015 IEEE International Symposium on Workload Characterization","volume":"164 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-10-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121222253","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 19

Big or Little: A Study of Mobile Interactive Applications on an Asymmetric Multi-core Platform 大或小:非对称多核平台上的移动交互应用研究

2015 IEEE International Symposium on Workload Characterization Pub Date : 2015-10-04 DOI: 10.1109/IISWC.2015.7

Wonik Seo, Daegil Im, Jeongim Choi, Jaehyuk Huh

{"title":"Big or Little: A Study of Mobile Interactive Applications on an Asymmetric Multi-core Platform","authors":"Wonik Seo, Daegil Im, Jeongim Choi, Jaehyuk Huh","doi":"10.1109/IISWC.2015.7","DOIUrl":"https://doi.org/10.1109/IISWC.2015.7","url":null,"abstract":"This paper characterizes a commercial mobile platform on an asymmetric multi-core processor, investigating its available thread-level parallelism (TLP) and the impact of core asymmetry on applications. This paper explores three critical aspects of asymmetric mobile systems, asymmetric hardware platform, application behavior, and the impact of scheduling and power management. First, this paper presents the performance and energy characteristics of a commercial asymmetric multi-core architecture with two core types. The comparison between big and little cores shows the potential benefit of asymmetric multi-cores for improving energy efficiency. Second, the paper investigates the available thread-level parallelism and core utilization behaviors of mobile interactive applications. Using popular mobile applications for the Android system, this paper analyzes the distinct TLP and CPU usage patterns of interactive applications. Third, the paper explores the impact of power governor and CPU scheduler on the asymmetric system. Multiple cores with heterogeneous core types complicate scheduling and frequency scaling schemes, since the scheduler must migrate threads to different core types, in addition to traditional load balancing. This study shows that the current mobile applications are not fully utilizing the asymmetric multi-cores due to the lack of TLP and low computational requirement for big cores.","PeriodicalId":142698,"journal":{"name":"2015 IEEE International Symposium on Workload Characterization","volume":"9 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-10-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121307642","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 25

PC Design, Use, and Purchase Relations 个人电脑设计、使用和购买关系

2015 IEEE International Symposium on Workload Characterization Pub Date : 2015-10-04 DOI: 10.1109/IISWC.2015.25

Al M. Rashid, B. Kuhn, B. Arbab, D. Kuck

引用次数: 2