2017 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)最新文献

SimBench: A portable benchmarking methodology for full-system simulators SimBench:用于全系统模拟器的便携式基准测试方法

2017 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS) Pub Date : 2017-07-13 DOI: 10.1109/ISPASS.2017.7975293

Harry Wagstaff, Bruno Bodin, T. Spink, Björn Franke

{"title":"SimBench: A portable benchmarking methodology for full-system simulators","authors":"Harry Wagstaff, Bruno Bodin, T. Spink, Björn Franke","doi":"10.1109/ISPASS.2017.7975293","DOIUrl":"https://doi.org/10.1109/ISPASS.2017.7975293","url":null,"abstract":"Full-system simulators are increasingly finding their way into the consumer space for the purposes of backwards compatibility and hardware emulation (e.g. for games consoles). For such compute-intensive applications simulation performance is paramount. In this paper we argue that existing benchmark suites such as SPEC CPU2006, originally designed for architecture and compiler performance evaluation, are not well suited for the identification of performance bottlenecks in full-system simulators. While their large, complex workloads provide an indication as to the performance of the simulator on ‘real-world’ workloads, this does not give any indication of why a particular simulator might run an application faster or slower than another. In this paper we present SimBench, an extensive suite of targeted micro-benchmarks designed to run bare-metal on a fullsystem simulator. SimBench exercises dynamic binary translation (DBT) performance, interrupt and exception handling, memory access performance, I/O and other performance-sensitive areas. SimBench is cross-platform benchmarking framework and can be retargeted to new architectures with minimal effort. For several simulators, including QEMU, Gem5 and SimIt-ARM, and targeting ARM and Intel x86 architectures, we demonstrate that SimBench is capable of accurately pinpointing and explaining real-world performance anomalies, which are largely obfuscated by existing application-oriented benchmarks.","PeriodicalId":123307,"journal":{"name":"2017 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)","volume":"129 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-07-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122512117","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 5

HW/SW co-designed processors: Challenges, design choices and a simulation infrastructure for evaluation 硬件/软件协同设计的处理器:挑战、设计选择和用于评估的仿真基础设施

2017 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS) Pub Date : 2017-07-13 DOI: 10.1109/ISPASS.2017.7975290

Rakesh Kumar, José Cano, Aleksandar Brankovic, Demos Pavlou, Kyriakos Stavrou, E. Gibert, Alejandro Martínez, Antonio González

{"title":"HW/SW co-designed processors: Challenges, design choices and a simulation infrastructure for evaluation","authors":"Rakesh Kumar, José Cano, Aleksandar Brankovic, Demos Pavlou, Kyriakos Stavrou, E. Gibert, Alejandro Martínez, Antonio González","doi":"10.1109/ISPASS.2017.7975290","DOIUrl":"https://doi.org/10.1109/ISPASS.2017.7975290","url":null,"abstract":"Improving single thread performance is a key challenge in modern microprocessors especially because the traditional approach of increasing clock frequency and deep pipelining cannot be pushed further due to power constraints. Therefore, researchers have been looking at unconventional architectures to boost single thread performance without running into the power wall. HW/SW co-designed processors like Nvidia Denver, are emerging as a promising alternative. However, HW/SW co-designed processors need to address some key challenges such as startup delay, providing high performance with simple hardware, translation/optimization overhead, etc. before they can become mainstream. A fundamental requirement for evaluating different design choices and trade-offs to meet these challenges is to have a simulation infrastructure. Unfortunately, there is no such infrastructure available today. Building the aforementioned infrastructure itself poses significant challenges as it encompasses the complexities of not only an architectural framework but also of a compilation one. This paper identifies the key challenges that HW/SW codesigned processors face and the basic requirements for a simulation infrastructure targeting these architectures. Furthermore, the paper presents DARCO, a simulation infrastructure to enable research in this domain.","PeriodicalId":123307,"journal":{"name":"2017 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)","volume":"5 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-07-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126591078","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 4

Evaluating and mitigating bandwidth bottlenecks across the memory hierarchy in GPUs 评估和减轻gpu内存层次结构中的带宽瓶颈

2017 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS) Pub Date : 2017-07-13 DOI: 10.1109/ISPASS.2017.7975295

Saumay Dublish, V. Nagarajan, N. Topham

{"title":"Evaluating and mitigating bandwidth bottlenecks across the memory hierarchy in GPUs","authors":"Saumay Dublish, V. Nagarajan, N. Topham","doi":"10.1109/ISPASS.2017.7975295","DOIUrl":"https://doi.org/10.1109/ISPASS.2017.7975295","url":null,"abstract":"GPUs are often limited by off-chip memory bandwidth. With the advent of general-purpose computing on GPUs, a cache hierarchy has been introduced to filter the bandwidth demand to the off-chip memory. However, the cache hierarchy presents its own bandwidth limitations in sustaining such high levels of memory traffic. In this paper, we characterize the bandwidth bottlenecks present across the memory hierarchy in GPUs for generalpurpose applications. We quantify the stalls throughout the memory hierarchy and identify the architectural parameters that play a pivotal role in leading to a congested memory system. We explore the architectural design space to mitigate the bandwidth bottlenecks and show that performance improvement achieved by mitigating the bandwidth bottleneck in the cache hierarchy can exceed the speedup obtained by a memory system with a baseline cache hierarchy and High Bandwidth Memory (HBM) DRAM. We also show that addressing the bandwidth bottleneck in isolation at specific levels can be sub-optimal and can even be counter-productive. Therefore, we show that it is imperative to resolve the bandwidth bottlenecks synergistically across different levels of the memory hierarchy. With the insights developed in this paper, we perform a cost-benefit analysis and identify costeffective configurations of the memory hierarchy that effectively mitigate the bandwidth bottlenecks. We show that our final configuration achieves a performance improvement of 29% on average with a minimal area overhead of 1.6%.","PeriodicalId":123307,"journal":{"name":"2017 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)","volume":"49 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-07-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130990151","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 9

Analyzing OpenCL 2.0 workloads using a heterogeneous CPU-GPU simulator 使用异构CPU-GPU模拟器分析OpenCL 2.0工作负载

2017 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS) Pub Date : 2017-04-24 DOI: 10.1109/ISPASS.2017.7975279

Li Wang, Ren-Wei Tsai, Shao-Chung Wang, K. Chen, Po-Han Wang, Hsiang-Yun Cheng, Yi-Chung Lee, Sheng-Jie Shu, Chun-Chieh Yang, Min-Yih Hsu, Li-Chen Kan, Chao-Lin Lee, Tzu-Chieh Yu, Rih-Ding Peng, Chia-Lin Yang, Yuan-Shin Hwang, Jenq-Kuen Lee, Shiao-Li Tsao, M. Ouhyoung

引用次数: 4

PMAL: Enabling lightweight adaptation of legacy file systems on persistent memory systems PMAL:在持久内存系统上启用遗留文件系统的轻量级适配

2017 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS) Pub Date : 2017-04-24 DOI: 10.1109/ISPASS.2017.7975268

Hyunsub Song, Y. Moon, Se Kwon Lee, S. Noh

{"title":"PMAL: Enabling lightweight adaptation of legacy file systems on persistent memory systems","authors":"Hyunsub Song, Y. Moon, Se Kwon Lee, S. Noh","doi":"10.1109/ISPASS.2017.7975268","DOIUrl":"https://doi.org/10.1109/ISPASS.2017.7975268","url":null,"abstract":"The advent of Persistent Memory (PM), which is anticipated to have byte-addressable access latency in par with DRAM and yet nonvolatile, has stepped up interest in using PM as storage. Hence, PM storage targeted file systems are being developed under the premise that legacy file systems are suboptimal on memory bus attached PM-based storage. However, many years of time and effort are ingrained in legacy file systems that are now time-tested and mature. Simply scrapping them altogether may be unwarranted. In this paper, we look into how we can leverage the maturity ingrained in legacy file systems to the fullest, while, at the same time, reaping the high performance offered by PM. To this end, we first go through a thorough analysis of legacy Ext4 file systems, and compare it with NOVA, PMFS, and Ext4 with DAX extension, which are new PM file systems available in Linux. Based on these analyses, we then propose the Persistent Memory Adaptation Layer (PMAL) module that is lightweight (roughly 180 LoC) and can easily be integrated into legacy file systems to take advantage of PM storage. Using Ext4, we show that the performance of PMAL integrated Ext4 is in par with PM file systems for the Filebench and key-value store benchmarks.","PeriodicalId":123307,"journal":{"name":"2017 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)","volume":"46 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-04-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116448140","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

Docker characterization on high performance SSDs 高性能ssd上的Docker特性

2017 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS) Pub Date : 2017-04-24 DOI: 10.1109/ISPASS.2017.7975282

Qiumin Xu, M. Awasthi, Krishna T. Malladi, J. Bhimani, Jingpei Yang, M. Annavaram

引用次数: 5

Performance analysis of CNN frameworks for GPUs 基于gpu的CNN框架性能分析

2017 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS) Pub Date : 2017-04-24 DOI: 10.1109/ISPASS.2017.7975270

Heehoon Kim, Hyoungwook Nam, Wookeun Jung, Jaejin Lee

{"title":"Performance analysis of CNN frameworks for GPUs","authors":"Heehoon Kim, Hyoungwook Nam, Wookeun Jung, Jaejin Lee","doi":"10.1109/ISPASS.2017.7975270","DOIUrl":"https://doi.org/10.1109/ISPASS.2017.7975270","url":null,"abstract":"Thanks to modern deep learning frameworks that exploit GPUs, convolutional neural networks (CNNs) have been greatly successful in visual recognition tasks. In this paper, we analyze the GPU performance characteristics of five popular deep learning frameworks: Caffe, CNTK, TensorFlow, Theano, and Torch in the perspective of a representative CNN model, AlexNet. Based on the characteristics obtained, we suggest possible optimization methods to increase the efficiency of CNN models built by the frameworks. We also show the GPU performance characteristics of different convolution algorithms each of which uses one of GEMM, direct convolution, FFT, and the Winograd method. We also suggest criteria to choose convolution algorithms for GPUs and methods to build efficient CNN models on GPUs. Since scaling DNNs in a multi-GPU context becomes increasingly important, we also analyze the scalability of the CNN models built by the deep learning frameworks in the multi-GPU context and their overhead. The result indicates that we can increase the speed of training the AlexNet model up to 2X by just changing options provided by the frameworks.","PeriodicalId":123307,"journal":{"name":"2017 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)","volume":"63 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-04-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134402336","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 78

Sharing the instruction cache among lean cores on an asymmetric CMP for HPC applications 在HPC应用程序的非对称CMP上在精益核之间共享指令缓存

2017 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS) Pub Date : 2017-04-24 DOI: 10.1109/ISPASS.2017.7975265

Ugljesa Milic, Alejandro Rico, P. Carpenter, Alex Ramírez

{"title":"Sharing the instruction cache among lean cores on an asymmetric CMP for HPC applications","authors":"Ugljesa Milic, Alejandro Rico, P. Carpenter, Alex Ramírez","doi":"10.1109/ISPASS.2017.7975265","DOIUrl":"https://doi.org/10.1109/ISPASS.2017.7975265","url":null,"abstract":"High performance computing (HPC) applications have parallel code sections that must scale to large numbers of cores, which makes them sensitive to serial regions. Current supercomputing systems with heterogeneous or asymmetric CMPs (ACMP) combine few high-performance big cores for serial regions, together with many low-power lean cores for throughput computing. The low requirements of HPC applications in the core front-end lead some designs, such as SMT and GPU cores, to share front-end structures including the instruction cache (I-cache). However, little work exists to analyze the benefit of sharing the I-cache among full cores, which seems compelling as a solution to reduce silicon area and power. This paper analyzes the performance, power and area impact of such a design on an ACMP with one high-performance core and multiple low-power cores. Having identified that multiple cores run the same code during parallel regions, the lean cores share the I-cache with the intent of benefiting from mutual prefetching, without increasing the average access latency. Our exploration of the multiple parameters finds the sweet spot on a wide interconnect to access the shared I-cache and the inclusion of a few line buffers to provide the required bandwidth and latency to sustain performance. The projections with McPAT and a rich set of HPC benchmarks show 11% area savings with a 5% energy reduction at no performance cost.","PeriodicalId":123307,"journal":{"name":"2017 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)","volume":"99 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-04-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123108208","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 3

Microarchitecture level reliability comparison of modern GPU designs: First findings 现代GPU设计的微架构级可靠性比较:初步发现

2017 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS) Pub Date : 2017-04-24 DOI: 10.1109/ISPASS.2017.7975280

Alessandro Vallero, S. Carlo, Sotiris Tselonis, D. Gizopoulos

引用次数: 6

Analyzing the scalability of managed language applications with speedup stacks 使用加速堆栈分析托管语言应用程序的可伸缩性

2017 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS) Pub Date : 2017-04-24 DOI: 10.1109/ISPASS.2017.7975267

Jennifer B. Sartor, Kristof Du Bois, Stijn Eyerman, L. Eeckhout

{"title":"Analyzing the scalability of managed language applications with speedup stacks","authors":"Jennifer B. Sartor, Kristof Du Bois, Stijn Eyerman, L. Eeckhout","doi":"10.1109/ISPASS.2017.7975267","DOIUrl":"https://doi.org/10.1109/ISPASS.2017.7975267","url":null,"abstract":"Understanding the reasons why multi-threaded applications do not achieve perfect scaling on modern multicore hardware is challenging. Furthermore, more and more modern programs are written in managed languages, which have extra service threads (e.g., to perform memory management), which may retard scalability and complicate performance analysis. In this paper, we extend speedup stacks, a previously-presented visualization tool to analyze multi-threaded program scalability, to managed applications. Speedup stacks are comprehensive bar graphs that break down an application's execution to explain the main causes of sublinear speedup, i.e., when some threads are not allowing the application to progress, and thus increasing the execution time. We not only expand speedup stacks to analyze how the managed language's service threads affect overall scalability, but also implement speedup stacks while running on native hardware. We monitor the application and service threads' scheduling behavior using light-weight OS kernel modules, incurring under 1% overhead running unmodified Java benchmarks. We add two performance delimiters targeting managed applications: garbage collection and main initialization activities. We analyze the scalability limitations of these benchmarks and the impact of using both a stop-the-world and a concurrent garbage collector with speedup stacks. Our visualization tool facilitates the identification of scalability bottlenecks both between application threads and of service threads, pointing developers to whether optimization should be focused on the language runtime or the application. Speedup stacks provide better program understanding for both program and system designers, which can help optimize multicore processor performance.","PeriodicalId":123307,"journal":{"name":"2017 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)","volume":"91 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-04-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116399587","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1