2022 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)最新文献_第3页

GPUCalorie: Floorplan Estimation for GPU Thermal Evaluation GPU热量:GPU热评估的平面图估算

2022 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS) Pub Date : 2022-05-01 DOI: 10.1109/ispass55109.2022.00034

M. Chow, Ali Jahanshahi, Ana Beltrán, S. Tan, Daniel Wong

引用次数: 0

Microarchitectural Performance Evaluation of AV1 Video Encoding Workloads AV1视频编码工作负载的微架构性能评估

2022 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS) Pub Date : 2022-05-01 DOI: 10.1109/ISPASS55109.2022.00038

Steffen Jensen, Jaekyu Lee, Dam Sunwoo, Matthew Horsnell, L. John

引用次数: 1

Left-shifter: A pre-silicon framework for usage model based performance verification of the PCIe interface in server processor system on chips 左移器:芯片上服务器处理器系统中基于使用模型的PCIe接口性能验证的预硅框架

2022 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS) Pub Date : 2022-05-01 DOI: 10.1109/ISPASS55109.2022.00009

Tessil Thomas, B. Venkatasubramanian, Dinesh Sthapit, Christopher Gray, Atresh Gummadavelly, J. Bergeron, Pankaj Mehta, Prabu Thangamuthu

{"title":"Left-shifter: A pre-silicon framework for usage model based performance verification of the PCIe interface in server processor system on chips","authors":"Tessil Thomas, B. Venkatasubramanian, Dinesh Sthapit, Christopher Gray, Atresh Gummadavelly, J. Bergeron, Pankaj Mehta, Prabu Thangamuthu","doi":"10.1109/ISPASS55109.2022.00009","DOIUrl":"https://doi.org/10.1109/ISPASS55109.2022.00009","url":null,"abstract":"Input/Output (IO) peripherals like storage devices and network interface cards play a significant role in determining the end user visible performance of many server applications. In addition, many server applications depend on accelerators to achieve the desired performance levels. PCIe is the de-facto standard used for connecting IO peripherals and accelerators to server processor System On Chips (SoC). Therefore, it is important to verify that PCIe interface(s) of a server processor SoC allows full utilization of the available PCIe link bandwidth with reasonable transaction latencies for PCIe traffic patterns corresponding to the most common ways in which PCIe IO devices and accelerators are used by applications. Currently, to the best of our knowledge, such IO and accelerator usage model based PCIe interface performance verification can only be done after the manufactured SoC is available (i.e., in post-silicon). Unfortunately, doing such verification in post-silicon means that if any serious performance issues are found, the SoC developer is forced to invest in costly rectification and remanufacturing of the SoC. In this paper, we introduce an emulation-based framework that enables a “shift-left” of usage model based PCIe interface performance verification from post-silicon to pre-silicon. In contrast to the current post-silicon-based approach, our framework offers a low cost, fast turnaround method to identify and fix PCIe related performance issues prior to manufacturing the chip.","PeriodicalId":115391,"journal":{"name":"2022 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)","volume":"71 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115650103","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Address Translation Conscious Caching and Prefetching for High Performance Cache Hierarchy 高性能缓存层次结构的地址转换意识缓存和预取

2022 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS) Pub Date : 2022-05-01 DOI: 10.1109/ispass55109.2022.00044

Vasudha, Biswabandan Panda

{"title":"Address Translation Conscious Caching and Prefetching for High Performance Cache Hierarchy","authors":"Vasudha, Biswabandan Panda","doi":"10.1109/ispass55109.2022.00044","DOIUrl":"https://doi.org/10.1109/ispass55109.2022.00044","url":null,"abstract":"Performance of Translation Lookaside Buffers (TLBs) and on-chip caches plays a crucial role in delivering high-performance for memory-intensive applications with irregular memory accesses. Our observations show that, on average, an L2 TLB (STLB) miss for address translation can stall the head of the reorder buffer (ROB) for a maximum of 50 cycles. The corresponding data request, also called as the replay load can stall the head of the ROB for more than 200 cycles. We show that current state-of-the-art mid-level (L2C) and last-level cache (LLC) replacement policies do not treat cache block with address translations and replay data access differently. As a result these policies fail to reduce ROB stalls because of translation and replay data access misses. To improve the performance further on top of high-performing cache replacement policies, we propose address translation and replay data access conscious cache replacement policies at L2C and LLC. Our enhancements help in reducing ROB stalls due to STLB misses by 28.76%. We also find that cache blocks storing replay loads are dead (no reuse after insertion), and cache replacement policies alone cannot mitigate the ROB stalls caused by replay data accesses. Hence, we propose an address translation hit triggered hardware prefetcher that brings replay data on an address translation hit at the L2C and LLC. This enhancement reduces ROB stalls due to replay data accesses by 18.5%. For a group of memory-intensive benchmarks with high STLB misses, our enhancements improve performance by 5.1% (reducing ROB stall cycles by 46.7%) and as high as 10.6%, on top of state-of-the-art cache replacement policies that are highly competitive. Our enhancements do not incur any additional storage overhead. However, we need additional flags from the page-table-walker into the cache hierarchy.","PeriodicalId":115391,"journal":{"name":"2022 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)","volume":"22 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123532160","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

Ruby: Improving Hardware Efficiency for Tensor Algebra Accelerators Through Imperfect Factorization Ruby:通过不完全分解提高张量代数加速器的硬件效率

2022 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS) Pub Date : 2022-05-01 DOI: 10.1109/ispass55109.2022.00039

Mark Horeni, Pooria Taheri, Po-An Tsai, A. Parashar, J. Emer, S. Joshi

{"title":"Ruby: Improving Hardware Efficiency for Tensor Algebra Accelerators Through Imperfect Factorization","authors":"Mark Horeni, Pooria Taheri, Po-An Tsai, A. Parashar, J. Emer, S. Joshi","doi":"10.1109/ispass55109.2022.00039","DOIUrl":"https://doi.org/10.1109/ispass55109.2022.00039","url":null,"abstract":"Finding high-quality mappings of Deep Neural Network (DNN) models onto tensor accelerators is critical for efficiency. State-of-the-art mapping exploration tools use remainderless (i.e., perfect) factorization to allocate hardware resources, through tiling the tensors, based on factors of tensor dimensions. This limits the size of the search space, (i.e., mapspace), but can lead to low resource utilization. We introduce a new mapspace, Ruby, that adds remainders (i.e., imperfect factorization) to expand the mapspace with high-quality mappings for user-defined architectures. This expansion allows us to allocate resources more precisely by generating tile sizes that better conform to hardware resources. However, this mapspace expansion also incurs an increase in the number of unique mappings. Consequently, this paper studies the trade-off between Ruby’s mapspace expansion and mapping quality. We propose Ruby-S (Spatial) to only employ imperfect factorization towards improved parallelism. Ruby-S incurs a moderate mapspace expansion while reducing energy-delay product (EDP) up to 50% when implementing ResNet-50 on an Eyeriss-like architecture with an average improvement of 20%. For the most part, this improvement can be attributed to higher compute utilization. EDP on a Simba-like architecture improves up to 40% with an average of 10%. For DeepBench workloads Ruby-S yields improvements of up to 45% with an average improvement of 10% on an Eyeriss-like architecture. Ruby-S is robust to accelerator configurations and improves EDP by 20% on average, with a maximum improvement of 55% when implementing ResNet-50 on different accelerator configurations. Ruby-S mappings form a new Pareto frontier, improving the performance of previous configurations by an average of 30% and 20% for ResNet-50 and DeepBench workloads respectively.","PeriodicalId":115391,"journal":{"name":"2022 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)","volume":"38 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124851655","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 3

Profiling an Architectural Simulator 对架构模拟器进行分析

2022 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS) Pub Date : 2022-05-01 DOI: 10.1109/ISPASS55109.2022.00032

Nedasadat Taheri, Alexander Manely, Ahmni R. Pang, Mohammad Alian

引用次数: 1

Cross-Level Characterization of Program Behavior : (Extended Poster Abstract) 程序行为的跨层次表征:(扩展海报摘要)

2022 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS) Pub Date : 2022-05-01 DOI: 10.1109/ISPASS55109.2022.00036

Li Tang, S. Pakin

引用次数: 0

Performance Analysis and Optimization with Little’s Law 基于利特尔定律的性能分析与优化

2022 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS) Pub Date : 2022-05-01 DOI: 10.1109/ispass55109.2022.00002

Sanyam Mehta

{"title":"Performance Analysis and Optimization with Little’s Law","authors":"Sanyam Mehta","doi":"10.1109/ispass55109.2022.00002","DOIUrl":"https://doi.org/10.1109/ispass55109.2022.00002","url":null,"abstract":"Performance tools are the bridge between processor architecture and a user. However, with the increasingly complex processor architectures, it is becoming increasingly difficult for the users to comprehend the information generated by the performance tools to help diagnose and fix the performance bottlenecks. In addition, the performance tools are themselves limited in many cases. Finally, there is wide variability in the kind of performance counters provided by the different processor vendors, making performance tools unportable across emerging architectures. In this work, we propose to solve these problems by accurately computing a portable and easily comprehensible performance metric - the (Memory-Level Parallelism) MLP of an application. The observed MLP when seen as a fraction of peak theoretical MLP supported by the host processor provides important guidance on the applicability of various popular program optimizations. Six case studies on three different processors each with a different memory technology show that our metric is both effective in program analysis and provides useful guidance on program optimization.","PeriodicalId":115391,"journal":{"name":"2022 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125852147","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 3

The Indigo Program-Verification Microbenchmark Suite of Irregular Parallel Code Patterns 不规则并行代码模式的Indigo程序验证微基准套件

2022 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS) Pub Date : 2022-05-01 DOI: 10.1109/ispass55109.2022.00003

Yiqiang Liu, Noushin Azami, Corbin Walters, Martin Burtscher

{"title":"The Indigo Program-Verification Microbenchmark Suite of Irregular Parallel Code Patterns","authors":"Yiqiang Liu, Noushin Azami, Corbin Walters, Martin Burtscher","doi":"10.1109/ispass55109.2022.00003","DOIUrl":"https://doi.org/10.1109/ispass55109.2022.00003","url":null,"abstract":"Irregular programs are found in many domains and tend to exhibit input-dependent control flow and memory accesses. This paper introduces the Indigo suite of important irregular parallel code patterns for testing verification and other tools. We studied many irregular CPU and GPU programs and extracted the key code patterns. Then, we methodically built variations of these patterns to alter the control-flow and memory-access behavior and/or introduce bugs, yielding the thousands of OpenMP and CUDA microbenchmarks in the suite. Indigo includes a set of generators to systematically create an unbounded number of inputs for each microbenchmark, which is essential to exercise the wide range of possible behaviors of input-dependent codes. To manage the millions of code and input combinations, Indigo provides the flexibility to generate user-defined subsets of the suite. Experiments with a subset of buggy and bug-free codes illustrate that irregular programs pose a significant challenge to both static and dynamic program verification tools. Moreover, such tools can perform quite differently across code patterns that contain the same bug.","PeriodicalId":115391,"journal":{"name":"2022 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)","volume":"681 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122974833","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

TILE-SIM: A Systematic Approach to Systolic Array-based Accelerator Evaluation TILE-SIM:一种基于收缩阵列的加速器评估系统方法

2022 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS) Pub Date : 2022-05-01 DOI: 10.1109/ISPASS55109.2022.00016

Yuhang Li, M. Wen, Jiawei Fei, Junzhong Shen, Yasong Cao

引用次数: 0