2022 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)最新文献

Building a Performance Model for Deep Learning Recommendation Model Training on GPUs 基于gpu的深度学习推荐模型训练性能模型构建

2022 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS) Pub Date : 2022-05-01 DOI: 10.1109/ISPASS55109.2022.00030

Zhongyi Lin, Louis Feng, Ehsan K. Ardestani, Jaewon Lee, John Lundell, Changkyu Kim, Arun Kejariwal, John D. Owens

{"title":"Building a Performance Model for Deep Learning Recommendation Model Training on GPUs","authors":"Zhongyi Lin, Louis Feng, Ehsan K. Ardestani, Jaewon Lee, John Lundell, Changkyu Kim, Arun Kejariwal, John D. Owens","doi":"10.1109/ISPASS55109.2022.00030","DOIUrl":"https://doi.org/10.1109/ISPASS55109.2022.00030","url":null,"abstract":"We devise a performance model for GPU training of Deep Learning Recommendation Models (DLRM), which has low GPU utilization (i.e., the percentage of per-batch training time when kernels are running on the device) compared to other well-optimized vision (CV) and natural language processing (NLP) models. We show that both the device active time (the sum of kernel runtimes) and idle time are important components of the overall device time, and can be tackled separately by (1) flexibly adopting heuristic- and ML-based kernel performance models for kernels that dominate the device active time, and (2) categorizing operator overheads into five types to quantitatively determine their contribution to the overall device time. Combining these two parts, we propose a critical-path-based algorithm to predict the per-batch training time of DLRM by traversing its execution graph. We achieve less than 10% geometric mean absolute error (GMAE) in all kernel performance modeling, and 5.23% and 7.96% geomean errors, respectively, for GPU active time and overall end-to-end per-batch training time prediction on the highly-customized and multi-factor dominated DLRM architectures. We also demonstrate our performance model’s ability to generalize to other compute-bound DL models targeted by most previous methods and better assist general model-system co-design than previous work.","PeriodicalId":115391,"journal":{"name":"2022 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)","volume":"24 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115674564","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

OS-level Implications of Using DRAM Caches in Memory Disaggregation 在内存分解中使用DRAM缓存的操作系统级含义

2022 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS) Pub Date : 2022-05-01 DOI: 10.1109/ISPASS55109.2022.00020

B. Gao, Hao-Wei Tee, Alireza Sanaee, Soh Boon Jun, Djordje Jevdjic

引用次数: 1

DRAM Bandwidth and Latency Stacks: Visualizing DRAM Bottlenecks DRAM带宽和延迟堆栈:可视化DRAM瓶颈

2022 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS) Pub Date : 2022-05-01 DOI: 10.1109/ispass55109.2022.00045

Stijn Eyerman, W. Heirman, I. Hur

{"title":"DRAM Bandwidth and Latency Stacks: Visualizing DRAM Bottlenecks","authors":"Stijn Eyerman, W. Heirman, I. Hur","doi":"10.1109/ispass55109.2022.00045","DOIUrl":"https://doi.org/10.1109/ispass55109.2022.00045","url":null,"abstract":"For memory-bound applications, memory bandwidth utilization and memory access latency determine performance. DRAM specifications mention the maximum peak bandwidth and uncontended read latency, but this number is never achieved in practice. Many factors impact the actually achieved bandwidth, and it is often not obvious to hardware architects or software developers how higher bandwidth usage, and thus higher performance, can be achieved. Similarly, latency is impacted by numerous technology constraints and queueing in the memory controller.DRAM bandwidth stacks intuitively visualize the memory bandwidth consumption of an application and indicate where potential bandwidth is lost. The top of the stack is the peak bandwidth, while the bottom component shows the actually achieved bandwidth. The other components show how much bandwidth is wasted on DRAM refresh, precharge and activate commands, or because of (parts of) the DRAM chip being idle when there are no memory operations available. DRAM latency stacks show the average latency of a memory read operation, divided into base read time, row conflict, and multiple queue components. DRAM bandwidth and latency stacks are complementary to CPI stacks and speedup stacks, providing additional insight to optimize the performance of an application or to improve the hardware.","PeriodicalId":115391,"journal":{"name":"2022 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)","volume":"8 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115023451","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

SAPCo Sort: optimizing Degree-Ordering for Power-Law Graphs SAPCo排序:优化幂律图的度排序

2022 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS) Pub Date : 2022-05-01 DOI: 10.1109/ISPASS55109.2022.00015

Mohsen Koohi Esfahani, Peter Kilpatrick, H. Vandierendonck

引用次数: 1

Flexible Binary Instrumentation Framework to Profile Code Running on Intel GPUs 灵活的二进制工具框架配置代码运行在英特尔gpu上

2022 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS) Pub Date : 2022-05-01 DOI: 10.1109/ispass55109.2022.00011

Alex Skaletsky, Konstantin Levit-Gurevich, Michael Berezalsky, Yulia Kuznetcova, Hila Yakov

引用次数: 1

Spatiotemporal Strategies for Long-Term FPGA Resource Management 长期FPGA资源管理的时空策略

2022 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS) Pub Date : 2022-05-01 DOI: 10.1109/ispass55109.2022.00026

Atefeh Mehrabi, Daniel J. Sorin, Benjamin C. Lee

引用次数: 0

FOURST: A code generator for FFT-based fast stencil computations 基于fft的快速模板计算的代码生成器

2022 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS) Pub Date : 2022-05-01 DOI: 10.1109/ispass55109.2022.00010

Zafar Ahmad, M. Javanmard, Gregory Croisdale, Aaron Gregory, P. Ganapathi, L. Pouchet, R. Chowdhury

引用次数: 1

Simulating Noisy Channels in DNA Storage 模拟DNA存储中的噪声信道

2022 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS) Pub Date : 2022-05-01 DOI: 10.1109/ispass55109.2022.00019

Mayank Keoliya, Purusotam Sharma, Djordje Jevdjic

引用次数: 1

A SIMT Analyzer for Multi-Threaded CPU Applications 多线程CPU应用程序的SIMT分析器

2022 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS) Pub Date : 2022-05-01 DOI: 10.1109/ispass55109.2022.00037

Ahmad Alawneh, Mahmoud Khairy, Timothy G. Rogers

{"title":"A SIMT Analyzer for Multi-Threaded CPU Applications","authors":"Ahmad Alawneh, Mahmoud Khairy, Timothy G. Rogers","doi":"10.1109/ispass55109.2022.00037","DOIUrl":"https://doi.org/10.1109/ispass55109.2022.00037","url":null,"abstract":"The use of GPUs for general purpose applications has drastically increased. However, the performance gain from porting multithreaded CPU workloads to massively parallel SIMT-based accelerators, like GPUs, is often unpredictable. Even with enough parallelism, programmers do not know if their CPU code will run well on a GPU without first investing the effort to refactor it into a GPGPU programming language. Most of this unpredictability stems from two key side-effects of the GPU’s energy-efficient SIMT hardware: control-flow and memory divergence.To alleviate this issue, we propose SIMTec, an analysis tool that computes the control-flow and memory divergence of arbitrary pre-compiled CPU binaries. The tool constructs and analyzes a dynamic control flow graph of the application, batches threads into warps and emulates the operation of a SIMT stack for each warp to compute the projected SIMT efficiency. Given each warp’s execution mask, memory coalescing is computed using the addresses accessed by memory instructions from parallel threads. The tool reports the SIMT efficiency and memory divergence characteristics.We validate SIMTec using a suite of 11 applications with both x86 CPU and CUDA GPU implementations on an NVIDIA Volta V100, demonstrating that SIMTec has a correlation factor of 1.00 and 0.98 for SIMT efficiency and memory divergence, respectively. To demonstrate the predictive power of SIMTec, we explore another 16 CPU workloads for which there is no 1:1 GPU implementation. We perform case studies on these applications that range from compute-intensive thread-parallel workloads to cloud-based request-parallel microservices. Using SIMTec, we demonstrate that many of these CPU-only workloads are amenable to SIMT acceleration as-is.","PeriodicalId":115391,"journal":{"name":"2022 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)","volume":"3 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114350819","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

FASE: A Fast, Accurate and Seamless Emulator for Custom Numerical Formats FASE:一个快速，准确和无缝的自定义数字格式模拟器

2022 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS) Pub Date : 2022-05-01 DOI: 10.1109/ISPASS55109.2022.00017

John Osorio Ríos, Adrià Armejach, E. Petit, G. Henry, Marc Casas

引用次数: 1