Harry Wagstaff, Bruno Bodin, T. Spink, Björn Franke
{"title":"SimBench: A portable benchmarking methodology for full-system simulators","authors":"Harry Wagstaff, Bruno Bodin, T. Spink, Björn Franke","doi":"10.1109/ISPASS.2017.7975293","DOIUrl":"https://doi.org/10.1109/ISPASS.2017.7975293","url":null,"abstract":"Full-system simulators are increasingly finding their way into the consumer space for the purposes of backwards compatibility and hardware emulation (e.g. for games consoles). For such compute-intensive applications simulation performance is paramount. In this paper we argue that existing benchmark suites such as SPEC CPU2006, originally designed for architecture and compiler performance evaluation, are not well suited for the identification of performance bottlenecks in full-system simulators. While their large, complex workloads provide an indication as to the performance of the simulator on ‘real-world’ workloads, this does not give any indication of why a particular simulator might run an application faster or slower than another. In this paper we present SimBench, an extensive suite of targeted micro-benchmarks designed to run bare-metal on a fullsystem simulator. SimBench exercises dynamic binary translation (DBT) performance, interrupt and exception handling, memory access performance, I/O and other performance-sensitive areas. SimBench is cross-platform benchmarking framework and can be retargeted to new architectures with minimal effort. For several simulators, including QEMU, Gem5 and SimIt-ARM, and targeting ARM and Intel x86 architectures, we demonstrate that SimBench is capable of accurately pinpointing and explaining real-world performance anomalies, which are largely obfuscated by existing application-oriented benchmarks.","PeriodicalId":123307,"journal":{"name":"2017 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)","volume":"129 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-07-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122512117","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Rakesh Kumar, José Cano, Aleksandar Brankovic, Demos Pavlou, Kyriakos Stavrou, E. Gibert, Alejandro Martínez, Antonio González
{"title":"HW/SW co-designed processors: Challenges, design choices and a simulation infrastructure for evaluation","authors":"Rakesh Kumar, José Cano, Aleksandar Brankovic, Demos Pavlou, Kyriakos Stavrou, E. Gibert, Alejandro Martínez, Antonio González","doi":"10.1109/ISPASS.2017.7975290","DOIUrl":"https://doi.org/10.1109/ISPASS.2017.7975290","url":null,"abstract":"Improving single thread performance is a key challenge in modern microprocessors especially because the traditional approach of increasing clock frequency and deep pipelining cannot be pushed further due to power constraints. Therefore, researchers have been looking at unconventional architectures to boost single thread performance without running into the power wall. HW/SW co-designed processors like Nvidia Denver, are emerging as a promising alternative. However, HW/SW co-designed processors need to address some key challenges such as startup delay, providing high performance with simple hardware, translation/optimization overhead, etc. before they can become mainstream. A fundamental requirement for evaluating different design choices and trade-offs to meet these challenges is to have a simulation infrastructure. Unfortunately, there is no such infrastructure available today. Building the aforementioned infrastructure itself poses significant challenges as it encompasses the complexities of not only an architectural framework but also of a compilation one. This paper identifies the key challenges that HW/SW codesigned processors face and the basic requirements for a simulation infrastructure targeting these architectures. Furthermore, the paper presents DARCO, a simulation infrastructure to enable research in this domain.","PeriodicalId":123307,"journal":{"name":"2017 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)","volume":"5 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-07-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126591078","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Evaluating and mitigating bandwidth bottlenecks across the memory hierarchy in GPUs","authors":"Saumay Dublish, V. Nagarajan, N. Topham","doi":"10.1109/ISPASS.2017.7975295","DOIUrl":"https://doi.org/10.1109/ISPASS.2017.7975295","url":null,"abstract":"GPUs are often limited by off-chip memory bandwidth. With the advent of general-purpose computing on GPUs, a cache hierarchy has been introduced to filter the bandwidth demand to the off-chip memory. However, the cache hierarchy presents its own bandwidth limitations in sustaining such high levels of memory traffic. In this paper, we characterize the bandwidth bottlenecks present across the memory hierarchy in GPUs for generalpurpose applications. We quantify the stalls throughout the memory hierarchy and identify the architectural parameters that play a pivotal role in leading to a congested memory system. We explore the architectural design space to mitigate the bandwidth bottlenecks and show that performance improvement achieved by mitigating the bandwidth bottleneck in the cache hierarchy can exceed the speedup obtained by a memory system with a baseline cache hierarchy and High Bandwidth Memory (HBM) DRAM. We also show that addressing the bandwidth bottleneck in isolation at specific levels can be sub-optimal and can even be counter-productive. Therefore, we show that it is imperative to resolve the bandwidth bottlenecks synergistically across different levels of the memory hierarchy. With the insights developed in this paper, we perform a cost-benefit analysis and identify costeffective configurations of the memory hierarchy that effectively mitigate the bandwidth bottlenecks. We show that our final configuration achieves a performance improvement of 29% on average with a minimal area overhead of 1.6%.","PeriodicalId":123307,"journal":{"name":"2017 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)","volume":"49 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-07-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130990151","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Analyzing OpenCL 2.0 workloads using a heterogeneous CPU-GPU simulator","authors":"Li Wang, Ren-Wei Tsai, Shao-Chung Wang, K. Chen, Po-Han Wang, Hsiang-Yun Cheng, Yi-Chung Lee, Sheng-Jie Shu, Chun-Chieh Yang, Min-Yih Hsu, Li-Chen Kan, Chao-Lin Lee, Tzu-Chieh Yu, Rih-Ding Peng, Chia-Lin Yang, Yuan-Shin Hwang, Jenq-Kuen Lee, Shiao-Li Tsao, M. Ouhyoung","doi":"10.1109/ISPASS.2017.7975279","DOIUrl":"https://doi.org/10.1109/ISPASS.2017.7975279","url":null,"abstract":"Heterogeneous CPU-GPU systems have recently emerged as an energy-efficient computing platform. A robust integrated CPU-GPU simulator is essential to facilitate researches in this direction. While few integrated CPU-GPU simulators are available, similar tools that support OpenCL 2.0, a widely used new standard with promising heterogeneous computing features, are currently missing. In this paper, we extend the existing integrated CPU-GPU simulator, gem5-gpu, to support OpenCL 2.0. In addition, we conduct experiments on the extended simulator to see the impact of new features introduced by OpenCL 2.0. Our OpenCL 2.0 compatible simulator is successfully validated against a state-of-the-art commercial product, and is expected to help boost future studies in heterogeneous CPU-GPU systems.","PeriodicalId":123307,"journal":{"name":"2017 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)","volume":"6 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-04-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125363729","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"PMAL: Enabling lightweight adaptation of legacy file systems on persistent memory systems","authors":"Hyunsub Song, Y. Moon, Se Kwon Lee, S. Noh","doi":"10.1109/ISPASS.2017.7975268","DOIUrl":"https://doi.org/10.1109/ISPASS.2017.7975268","url":null,"abstract":"The advent of Persistent Memory (PM), which is anticipated to have byte-addressable access latency in par with DRAM and yet nonvolatile, has stepped up interest in using PM as storage. Hence, PM storage targeted file systems are being developed under the premise that legacy file systems are suboptimal on memory bus attached PM-based storage. However, many years of time and effort are ingrained in legacy file systems that are now time-tested and mature. Simply scrapping them altogether may be unwarranted. In this paper, we look into how we can leverage the maturity ingrained in legacy file systems to the fullest, while, at the same time, reaping the high performance offered by PM. To this end, we first go through a thorough analysis of legacy Ext4 file systems, and compare it with NOVA, PMFS, and Ext4 with DAX extension, which are new PM file systems available in Linux. Based on these analyses, we then propose the Persistent Memory Adaptation Layer (PMAL) module that is lightweight (roughly 180 LoC) and can easily be integrated into legacy file systems to take advantage of PM storage. Using Ext4, we show that the performance of PMAL integrated Ext4 is in par with PM file systems for the Filebench and key-value store benchmarks.","PeriodicalId":123307,"journal":{"name":"2017 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)","volume":"46 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-04-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116448140","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Qiumin Xu, M. Awasthi, Krishna T. Malladi, J. Bhimani, Jingpei Yang, M. Annavaram
{"title":"Docker characterization on high performance SSDs","authors":"Qiumin Xu, M. Awasthi, Krishna T. Malladi, J. Bhimani, Jingpei Yang, M. Annavaram","doi":"10.1109/ISPASS.2017.7975282","DOIUrl":"https://doi.org/10.1109/ISPASS.2017.7975282","url":null,"abstract":"Docker containers [2] are becoming the mainstay for deploying applications in cloud platforms, having many desirable features like ease of deployment, developer friendliness and lightweight virtualization. Meanwhile, solid state disks (SSDs) have witnessed tremendous performance boost through recent innovations in industry such as Non-Volatile Memory Express (NVMe) standards [3], [4]. However, the performance of containerized applications on these high speed contemporary SSDs has not yet been investigated. In this paper, we present a characterization of the performance impact among a wide variety of the available storage options for deploying Docker containers and provide the configuration options to best utilize the high performance SSDs.","PeriodicalId":123307,"journal":{"name":"2017 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)","volume":"285 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-04-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131785002","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Heehoon Kim, Hyoungwook Nam, Wookeun Jung, Jaejin Lee
{"title":"Performance analysis of CNN frameworks for GPUs","authors":"Heehoon Kim, Hyoungwook Nam, Wookeun Jung, Jaejin Lee","doi":"10.1109/ISPASS.2017.7975270","DOIUrl":"https://doi.org/10.1109/ISPASS.2017.7975270","url":null,"abstract":"Thanks to modern deep learning frameworks that exploit GPUs, convolutional neural networks (CNNs) have been greatly successful in visual recognition tasks. In this paper, we analyze the GPU performance characteristics of five popular deep learning frameworks: Caffe, CNTK, TensorFlow, Theano, and Torch in the perspective of a representative CNN model, AlexNet. Based on the characteristics obtained, we suggest possible optimization methods to increase the efficiency of CNN models built by the frameworks. We also show the GPU performance characteristics of different convolution algorithms each of which uses one of GEMM, direct convolution, FFT, and the Winograd method. We also suggest criteria to choose convolution algorithms for GPUs and methods to build efficient CNN models on GPUs. Since scaling DNNs in a multi-GPU context becomes increasingly important, we also analyze the scalability of the CNN models built by the deep learning frameworks in the multi-GPU context and their overhead. The result indicates that we can increase the speed of training the AlexNet model up to 2X by just changing options provided by the frameworks.","PeriodicalId":123307,"journal":{"name":"2017 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)","volume":"63 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-04-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134402336","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Ugljesa Milic, Alejandro Rico, P. Carpenter, Alex Ramírez
{"title":"Sharing the instruction cache among lean cores on an asymmetric CMP for HPC applications","authors":"Ugljesa Milic, Alejandro Rico, P. Carpenter, Alex Ramírez","doi":"10.1109/ISPASS.2017.7975265","DOIUrl":"https://doi.org/10.1109/ISPASS.2017.7975265","url":null,"abstract":"High performance computing (HPC) applications have parallel code sections that must scale to large numbers of cores, which makes them sensitive to serial regions. Current supercomputing systems with heterogeneous or asymmetric CMPs (ACMP) combine few high-performance big cores for serial regions, together with many low-power lean cores for throughput computing. The low requirements of HPC applications in the core front-end lead some designs, such as SMT and GPU cores, to share front-end structures including the instruction cache (I-cache). However, little work exists to analyze the benefit of sharing the I-cache among full cores, which seems compelling as a solution to reduce silicon area and power. This paper analyzes the performance, power and area impact of such a design on an ACMP with one high-performance core and multiple low-power cores. Having identified that multiple cores run the same code during parallel regions, the lean cores share the I-cache with the intent of benefiting from mutual prefetching, without increasing the average access latency. Our exploration of the multiple parameters finds the sweet spot on a wide interconnect to access the shared I-cache and the inclusion of a few line buffers to provide the required bandwidth and latency to sustain performance. The projections with McPAT and a rich set of HPC benchmarks show 11% area savings with a 5% energy reduction at no performance cost.","PeriodicalId":123307,"journal":{"name":"2017 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)","volume":"99 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-04-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123108208","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Alessandro Vallero, S. Carlo, Sotiris Tselonis, D. Gizopoulos
{"title":"Microarchitecture level reliability comparison of modern GPU designs: First findings","authors":"Alessandro Vallero, S. Carlo, Sotiris Tselonis, D. Gizopoulos","doi":"10.1109/ISPASS.2017.7975280","DOIUrl":"https://doi.org/10.1109/ISPASS.2017.7975280","url":null,"abstract":"State-of-the-art GPU chips are designed to deliver extreme throughput for graphics as well as for data-parallel general purpose computing workloads (GPGPU computing). Unlike graphics computing, GPGPU computing requires highly reliable operation. The performance-oriented design of GPUs requires to jointly evaluate the vulnerability of GPU workloads to soft-errors with the performance of GPU chips. We briefly present a summary of the findings of an extensive study aiming at the evaluation of the reliability of four GPU architectures and corresponding chips, orrelating them with the performance of the workloads.","PeriodicalId":123307,"journal":{"name":"2017 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)","volume":"11 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-04-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133218381","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Jennifer B. Sartor, Kristof Du Bois, Stijn Eyerman, L. Eeckhout
{"title":"Analyzing the scalability of managed language applications with speedup stacks","authors":"Jennifer B. Sartor, Kristof Du Bois, Stijn Eyerman, L. Eeckhout","doi":"10.1109/ISPASS.2017.7975267","DOIUrl":"https://doi.org/10.1109/ISPASS.2017.7975267","url":null,"abstract":"Understanding the reasons why multi-threaded applications do not achieve perfect scaling on modern multicore hardware is challenging. Furthermore, more and more modern programs are written in managed languages, which have extra service threads (e.g., to perform memory management), which may retard scalability and complicate performance analysis. In this paper, we extend speedup stacks, a previously-presented visualization tool to analyze multi-threaded program scalability, to managed applications. Speedup stacks are comprehensive bar graphs that break down an application's execution to explain the main causes of sublinear speedup, i.e., when some threads are not allowing the application to progress, and thus increasing the execution time. We not only expand speedup stacks to analyze how the managed language's service threads affect overall scalability, but also implement speedup stacks while running on native hardware. We monitor the application and service threads' scheduling behavior using light-weight OS kernel modules, incurring under 1% overhead running unmodified Java benchmarks. We add two performance delimiters targeting managed applications: garbage collection and main initialization activities. We analyze the scalability limitations of these benchmarks and the impact of using both a stop-the-world and a concurrent garbage collector with speedup stacks. Our visualization tool facilitates the identification of scalability bottlenecks both between application threads and of service threads, pointing developers to whether optimization should be focused on the language runtime or the application. Speedup stacks provide better program understanding for both program and system designers, which can help optimize multicore processor performance.","PeriodicalId":123307,"journal":{"name":"2017 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)","volume":"91 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-04-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116399587","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}