2015 IEEE International Symposium on Workload Characterization最新文献_第2页

3D Workload Subsetting for GPU Architecture Pathfinding GPU架构寻路的3D工作负载子集

2015 IEEE International Symposium on Workload Characterization Pub Date : 2015-10-04 DOI: 10.1109/IISWC.2015.24

V. George

{"title":"3D Workload Subsetting for GPU Architecture Pathfinding","authors":"V. George","doi":"10.1109/IISWC.2015.24","DOIUrl":"https://doi.org/10.1109/IISWC.2015.24","url":null,"abstract":"Growth of high-end 3D gaming, expansion of gaming to new devices like tablets and phones, and evolution of multiple Graphics APIs like Direct3D 10+, and OpenGL 3.0+ have led to an explosion in the number of workloads that need to be evaluated for GPU architecture path-finding. To decide on the optimal architecture configuration, the workloads need to be simulated on a wide range of architecture designs which incurs huge cost, both in terms of time and resources. In order to reduce the simulation cost of path-finding, extracting workload subsets from 3D workloads is essential. This paper presents a methodology to find representative workload subsets from 3D workloads by combining clustering and phase detection. In the first part, this paper presents a methodology to group draw-calls based on performance similarity by clustering on their micro architecture independent characteristics. Across 717 frames encompassing 828K draw-calls, the clustering solution obtained an average performance prediction error per frame of 1.0% at an average clustering efficiency of 65.8%. The clustering quality is additionally evaluated by calculating cluster outliers, which are clusters with intra cluster prediction error greater than 20%. The clustering quality, measured using cluster outliers, is an indication of the performance similarity of the individual clusters. Across the spectrum of frames, we found that on an average only 3.0% of the clusters are outliers which indicates a high clustering quality. In order to detect repetitive behavior in 3D workloads, we propose characterization of frame intervals using shader vectors and then using shader vector equality to extract the repeating patterns. We show that phases exist in each game in the Bio shock series enabling extraction of small representative subsets from the workloads. Performance improvement of the workload subsets, which are less than one percent of parent workload, with GPU frequency scaling has high correlation (correlation coefficient=99.7%+) to the performance improvement of its parent workload.","PeriodicalId":142698,"journal":{"name":"2015 IEEE International Symposium on Workload Characterization","volume":"81 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-10-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121338364","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 3

On Power-Performance Characterization of Concurrent Throughput Kernels 并发吞吐量内核的功率性能表征

2015 IEEE International Symposium on Workload Characterization Pub Date : 2015-10-04 DOI: 10.1109/IISWC.2015.17

Nilanjan Goswami, Yuhai Li, Amer Qouneh, Chao Li, Tao Li

引用次数: 0

Revealing Critical Loads and Hidden Data Locality in GPGPU Applications 揭示GPGPU应用程序中的关键负载和隐藏数据位置

2015 IEEE International Symposium on Workload Characterization Pub Date : 2015-10-04 DOI: 10.1109/IISWC.2015.23

Gunjae Koo, Hyeran Jeon, M. Annavaram

{"title":"Revealing Critical Loads and Hidden Data Locality in GPGPU Applications","authors":"Gunjae Koo, Hyeran Jeon, M. Annavaram","doi":"10.1109/IISWC.2015.23","DOIUrl":"https://doi.org/10.1109/IISWC.2015.23","url":null,"abstract":"In graphics processing units (GPUs), memory access latency is one of the most critical performance hurdles. Several warp schedulers and memory prefetching algorithms have been proposed to avoid the long memory access latency. Prior application characterization studies shed light on the interaction between applications, GPU micro architecture and memory subsystem behavior. Most of these studies, however, only present aggregate statistics on how memory system behaves over the entire application run. In particular, they do not consider how individual load instructions in a program contribute to the aggregate memory system behavior. The analysis presented in this paper shows that there are two distinct classes of load instructions, categorized as deterministic and non-deterministic loads. Using a combination of profiling data from a real GPU card and cycle accurate simulation data we show that there is a significant performance impact disparity when executing these two types of loads. We discuss and suggest several approaches to treat these two load categories differently within the GPU micro architecture for optimizing memory system performance.","PeriodicalId":142698,"journal":{"name":"2015 IEEE International Symposium on Workload Characterization","volume":"213 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-10-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117190826","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 17

Characterization of Shared Library Access Patterns of Android Applications Android应用程序共享库访问模式的表征

2015 IEEE International Symposium on Workload Characterization Pub Date : 2015-10-04 DOI: 10.1109/IISWC.2015.19

Xiaowan Dong, S. Dwarkadas, A. Cox

引用次数: 4

Characterizing Data Analytics Workloads on Intel Xeon Phi 在Intel Xeon Phi处理器上表征数据分析工作负载

2015 IEEE International Symposium on Workload Characterization Pub Date : 2015-10-04 DOI: 10.1109/IISWC.2015.20

Biwei Xie, Xu Liu, Jianfeng Zhan, Zhen Jia, Yuqing Zhu, Lei Wang, Lixin Zhang

{"title":"Characterizing Data Analytics Workloads on Intel Xeon Phi","authors":"Biwei Xie, Xu Liu, Jianfeng Zhan, Zhen Jia, Yuqing Zhu, Lei Wang, Lixin Zhang","doi":"10.1109/IISWC.2015.20","DOIUrl":"https://doi.org/10.1109/IISWC.2015.20","url":null,"abstract":"With the growing computation demands of data analytics, heterogeneous architectures become popular for their support of high parallelism. Intel Xeon Phi, a many-core coprocessor originally designed for high performance computing applications, is promising for data analytics workloads. However, to the best of knowledge, there is no prior work systematically characterizing the performance of data analytics workloads on Xeon Phi. It is difficult to design a benchmark suite to represent the behavior of data analytics workloads on Xeon Phi. The main challenge resides in fully exploiting Xeon Phi's features, such as long SIMD instruction, simultaneous multithreading, and complex memory hierarchy. To address this issue, we develop Big Data Bench-Phi, which consists of seven representative data analytics workloads. All of these benchmarks are optimized for Xeon Phi and able to characterize Xeon Phi's support for data analytics workloads. Compared with a 24-core Xeon E5-2620 machine, Big Data Bench-Phi achieves reasonable speedups for most of its benchmarks, ranging from 1.5 to 23.4X. Our experiments show that workloads working on high-dimensional matrices can significantly benefit from instruction- and thread-level parallelism on Xeon Phi.","PeriodicalId":142698,"journal":{"name":"2015 IEEE International Symposium on Workload Characterization","volume":"53 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-10-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128135026","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 6

Differential Fault Injection on Microarchitectural Simulators 微架构模拟器的差分故障注入

2015 IEEE International Symposium on Workload Characterization Pub Date : 2015-10-04 DOI: 10.1109/IISWC.2015.28

Manolis Kaliorakis, Sotiris Tselonis, Athanasios Chatzidimitriou, N. Foutris, D. Gizopoulos

{"title":"Differential Fault Injection on Microarchitectural Simulators","authors":"Manolis Kaliorakis, Sotiris Tselonis, Athanasios Chatzidimitriou, N. Foutris, D. Gizopoulos","doi":"10.1109/IISWC.2015.28","DOIUrl":"https://doi.org/10.1109/IISWC.2015.28","url":null,"abstract":"Fault injection on micro architectural structures modeled in performance simulators is an effective method for the assessment of microprocessors reliability in early design stages. Compared to lower level fault injection approaches it is orders of magnitude faster and allows execution of large portions of workloads to study the effect of faults to the final program output. Moreover, for many important hardware components it delivers accurate reliability estimates compared to analytical methods which are fast but are known to significantly over-estimate a structure's vulnerability to faults. This paper investigates the effectiveness of micro architectural fault injection for x86 and ARM microprocessors in a differential way: by developing and comparing two fault injection frameworks on top of the most popular performance simulators, MARSS and Gem5. The injectors, called MaFIN and GeFIN (for MARSS-based and Gem5-based Fault Injector, respectively), are designed for accurate reliability studies and deliver several contributions among which: (a) reliability studies for a wide set of fault models on major hardware structures (for different sizes and organizations), (b) study on the reliability sensitivity of micro architecture structures for the same ISA (x86) implemented on two different simulators, (c) study on the reliability of workloads and micro architectures for the two most popular ISAs (ARM vs. x86). For the workloads of our experimental study we analyze the common trends observed in the CPU reliability assessments produced by the two injectors. Also, we explain the sources of difference when diverging reliability reports are provided by the tools. Both the common trends and the differences are attributed to fundamental implementations of the simulators and are supported by benchmarks runtime statistics. The insights of our analysis can guide the selection of the most appropriate tool for hardware reliability studies (and thus decision-making for protection mechanisms) on certain micro architectures for the popular x86 and ARM ISAs.","PeriodicalId":142698,"journal":{"name":"2015 IEEE International Symposium on Workload Characterization","volume":"26 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-10-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128520722","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 64

Performance Characterization of High-Level Programming Models for GPU Graph Analytics GPU图形分析高级编程模型的性能表征

2015 IEEE International Symposium on Workload Characterization Pub Date : 2015-10-04 DOI: 10.1109/IISWC.2015.13

Yuduo Wu, Yangzihao Wang, Yuechao Pan, Carl Yang, John Douglas Owens

引用次数: 23

Power Aware NUMA Scheduler in VMware's ESXi Hypervisor VMware的ESXi Hypervisor中的Power Aware NUMA Scheduler

2015 IEEE International Symposium on Workload Characterization Pub Date : 2015-10-04 DOI: 10.1109/IISWC.2015.30

Qasim Ali, Haoqiang Zheng, Tim Mann, Raghunathan Srinivasan

{"title":"Power Aware NUMA Scheduler in VMware's ESXi Hypervisor","authors":"Qasim Ali, Haoqiang Zheng, Tim Mann, Raghunathan Srinivasan","doi":"10.1109/IISWC.2015.30","DOIUrl":"https://doi.org/10.1109/IISWC.2015.30","url":null,"abstract":"Virtualized platforms have emerged as the top solution for cloud computing, especially in today's power-constrained data centers. Virtualization helps save power and energy by allowing physical machines to be replaced by virtual machines (VMs) and then consolidated onto a smaller number of physical hosts. The number of physical hosts that are powered on can even be dynamically varied, as with VMware's Distributed Power Management (DPM) feature. At a lower level, it remains valuable to manage power usage within each individual host, and typical systems, including VMware's ESXi hypervisor, do so by adjusting each processor's P-states (frequency and voltage states) and Cstates (idle states) according to the demands of the current workload. With current NUMA systems, however, there is an intermediate level of power management possible that has gone largely unexplored. In this paper we propose to optimize the placement of virtual machines on NUMA enabled systems, such that the overall energy consumption of the virtualized system is reduced with minimal impact on VM performance. Our heuristics exploit a relatively new CPU hardware feature, called independent package C-states. To the best of our knowledge, this paper presents the first work on making a NUMA scheduler power-aware by exploiting independent package C-states. We implemented a simple heuristic in ESXi and observed power savings of up to 26% and energy efficiency improvements of up to 30% using four realistic workloads and two micro-benchmarks.","PeriodicalId":142698,"journal":{"name":"2015 IEEE International Symposium on Workload Characterization","volume":"3 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-10-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134091009","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 6

CRONO: A Benchmark Suite for Multithreaded Graph Algorithms Executing on Futuristic Multicores CRONO:在未来多核上执行多线程图形算法的基准套件

2015 IEEE International Symposium on Workload Characterization Pub Date : 2015-10-04 DOI: 10.1109/IISWC.2015.11

Masab Ahmad, Farrukh Hijaz, Qingchuan Shi, O. Khan

{"title":"CRONO: A Benchmark Suite for Multithreaded Graph Algorithms Executing on Futuristic Multicores","authors":"Masab Ahmad, Farrukh Hijaz, Qingchuan Shi, O. Khan","doi":"10.1109/IISWC.2015.11","DOIUrl":"https://doi.org/10.1109/IISWC.2015.11","url":null,"abstract":"Algorithms operating on a graph setting are known to be highly irregular and unstructured. This leads to workload imbalance and data locality challenge when these algorithms are parallelized and executed on the evolving multicore processors. Previous parallel benchmark suites for shared memory multicores have focused on various workload domains, such as scientific, graphics, vision, financial and media processing. However, these suites lack graph applications that must be evaluated in the context of architectural design space exploration for futuristic multicores. This paper presents CRONO, a benchmark suite composed of multi-threaded graph algorithms for shared memory multicore processors. We analyze and characterize these benchmarks using a multicore simulator, as well as a real multicore machine setup. CRONO uses both synthetic and real world graphs. Our characterization shows that graph benchmarks are diverse and challenging in the context of scaling efficiency. They exhibit low locality due to unstructured memory access patterns, and incur fine-grain communication between threads. Energy overheads also occur due to nondeterministic memory and synchronization patterns on network connections. Our characterization reveals that these challenges remain in state-of-the-art graph algorithms, and in this context CRONO can be used to identify, analyze and develop novel architectural methods to mitigate their efficiency bottlenecks in futuristic multicore processors.","PeriodicalId":142698,"journal":{"name":"2015 IEEE International Symposium on Workload Characterization","volume":"12 3 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-10-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116789908","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 103

Exploring Parallel Programming Models for Heterogeneous Computing Systems 探索异构计算系统的并行编程模型

2015 IEEE International Symposium on Workload Characterization Pub Date : 2015-10-04 DOI: 10.1109/IISWC.2015.16

Mayank Daga, Zachary S. Tschirhart, Chip Freitag

{"title":"Exploring Parallel Programming Models for Heterogeneous Computing Systems","authors":"Mayank Daga, Zachary S. Tschirhart, Chip Freitag","doi":"10.1109/IISWC.2015.16","DOIUrl":"https://doi.org/10.1109/IISWC.2015.16","url":null,"abstract":"Parallel systems that employ CPUs and GPUs as two heterogeneous computational units have become immensely popular due to their ability to maximize performance under restrictive thermal budgets. However, programming heterogeneous systems via traditional programming models like OpenCL or CUDA involves rewriting large portions of application-code. They also lead to code that is not performance portable across different architectures or even across different generations of the same architecture. In this paper, we evaluate the current state of two emerging parallel programming models: C++ AMP and OpenACC. These emerging programming paradigms require minimal code changes and rely on compilers to interact with the low-level hardware language, thereby producing performance portable code from an application standpoint. We analyze the performance and productivity of the emerging programming models and compare them with OpenCL using a diverse set of applications on two different architectures, a CPU coupled with a discrete GPU and an Accelerated Programming Unit (APU). Our experiments demonstrate that while the emerging programming models improve programmer productivity, they do not yet expose enough flexibility to extract maximum performance as compared to traditional programming models.","PeriodicalId":142698,"journal":{"name":"2015 IEEE International Symposium on Workload Characterization","volume":"82 2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-10-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116943006","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 17