2022 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)最新文献

筛选
英文 中文
Building a Performance Model for Deep Learning Recommendation Model Training on GPUs 基于gpu的深度学习推荐模型训练性能模型构建
2022 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS) Pub Date : 2022-05-01 DOI: 10.1109/ISPASS55109.2022.00030
Zhongyi Lin, Louis Feng, Ehsan K. Ardestani, Jaewon Lee, John Lundell, Changkyu Kim, Arun Kejariwal, John D. Owens
{"title":"Building a Performance Model for Deep Learning Recommendation Model Training on GPUs","authors":"Zhongyi Lin, Louis Feng, Ehsan K. Ardestani, Jaewon Lee, John Lundell, Changkyu Kim, Arun Kejariwal, John D. Owens","doi":"10.1109/ISPASS55109.2022.00030","DOIUrl":"https://doi.org/10.1109/ISPASS55109.2022.00030","url":null,"abstract":"We devise a performance model for GPU training of Deep Learning Recommendation Models (DLRM), which has low GPU utilization (i.e., the percentage of per-batch training time when kernels are running on the device) compared to other well-optimized vision (CV) and natural language processing (NLP) models. We show that both the device active time (the sum of kernel runtimes) and idle time are important components of the overall device time, and can be tackled separately by (1) flexibly adopting heuristic- and ML-based kernel performance models for kernels that dominate the device active time, and (2) categorizing operator overheads into five types to quantitatively determine their contribution to the overall device time. Combining these two parts, we propose a critical-path-based algorithm to predict the per-batch training time of DLRM by traversing its execution graph. We achieve less than 10% geometric mean absolute error (GMAE) in all kernel performance modeling, and 5.23% and 7.96% geomean errors, respectively, for GPU active time and overall end-to-end per-batch training time prediction on the highly-customized and multi-factor dominated DLRM architectures. We also demonstrate our performance model’s ability to generalize to other compute-bound DL models targeted by most previous methods and better assist general model-system co-design than previous work.","PeriodicalId":115391,"journal":{"name":"2022 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)","volume":"24 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115674564","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
OS-level Implications of Using DRAM Caches in Memory Disaggregation 在内存分解中使用DRAM缓存的操作系统级含义
2022 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS) Pub Date : 2022-05-01 DOI: 10.1109/ISPASS55109.2022.00020
B. Gao, Hao-Wei Tee, Alireza Sanaee, Soh Boon Jun, Djordje Jevdjic
{"title":"OS-level Implications of Using DRAM Caches in Memory Disaggregation","authors":"B. Gao, Hao-Wei Tee, Alireza Sanaee, Soh Boon Jun, Djordje Jevdjic","doi":"10.1109/ISPASS55109.2022.00020","DOIUrl":"https://doi.org/10.1109/ISPASS55109.2022.00020","url":null,"abstract":"Memory disaggregation has attracted great attention recently due to its benefits in resource utilization efficiency, isolation of failures, and easier reconfiguration of memory hardware. However, applications running on a system with disaggregated memory are expected to suffer from performance degradation due to increased remote memory access latency and network contention. The performance gap is meant to be bridged using DRAM caches on the processor side, which would filter out most of the network traffic.This work examines the overheads of the disaggregated memory abstraction. By experimenting with both micro-benchmarks and production applications, we observe severe degradation in memory access latency and potential bottlenecks within the OS kernel. These bottlenecks could potentially be avoided through low-level optimizations in memory management tailored for memory disaggregation.","PeriodicalId":115391,"journal":{"name":"2022 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)","volume":"5 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125968406","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
DRAM Bandwidth and Latency Stacks: Visualizing DRAM Bottlenecks DRAM带宽和延迟堆栈:可视化DRAM瓶颈
2022 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS) Pub Date : 2022-05-01 DOI: 10.1109/ispass55109.2022.00045
Stijn Eyerman, W. Heirman, I. Hur
{"title":"DRAM Bandwidth and Latency Stacks: Visualizing DRAM Bottlenecks","authors":"Stijn Eyerman, W. Heirman, I. Hur","doi":"10.1109/ispass55109.2022.00045","DOIUrl":"https://doi.org/10.1109/ispass55109.2022.00045","url":null,"abstract":"For memory-bound applications, memory bandwidth utilization and memory access latency determine performance. DRAM specifications mention the maximum peak bandwidth and uncontended read latency, but this number is never achieved in practice. Many factors impact the actually achieved bandwidth, and it is often not obvious to hardware architects or software developers how higher bandwidth usage, and thus higher performance, can be achieved. Similarly, latency is impacted by numerous technology constraints and queueing in the memory controller.DRAM bandwidth stacks intuitively visualize the memory bandwidth consumption of an application and indicate where potential bandwidth is lost. The top of the stack is the peak bandwidth, while the bottom component shows the actually achieved bandwidth. The other components show how much bandwidth is wasted on DRAM refresh, precharge and activate commands, or because of (parts of) the DRAM chip being idle when there are no memory operations available. DRAM latency stacks show the average latency of a memory read operation, divided into base read time, row conflict, and multiple queue components. DRAM bandwidth and latency stacks are complementary to CPI stacks and speedup stacks, providing additional insight to optimize the performance of an application or to improve the hardware.","PeriodicalId":115391,"journal":{"name":"2022 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)","volume":"8 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115023451","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
SAPCo Sort: optimizing Degree-Ordering for Power-Law Graphs SAPCo排序:优化幂律图的度排序
2022 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS) Pub Date : 2022-05-01 DOI: 10.1109/ISPASS55109.2022.00015
Mohsen Koohi Esfahani, Peter Kilpatrick, H. Vandierendonck
{"title":"SAPCo Sort: optimizing Degree-Ordering for Power-Law Graphs","authors":"Mohsen Koohi Esfahani, Peter Kilpatrick, H. Vandierendonck","doi":"10.1109/ISPASS55109.2022.00015","DOIUrl":"https://doi.org/10.1109/ISPASS55109.2022.00015","url":null,"abstract":"We introduce the Structure-Aware Parattet Counting (SAPCo) Sort algorithm that optimizes performance of degree-ordering, a key operation in graph analytics. SAPCo leverages the skewed degree distribution to accelerate sorting. The evaluation for graphs of up to 3.6 billion vertices shows that SAPCo sort is, on average, 1.7-33.5 times faster than state-of-the-art sorting algorithms such as counting sort, radix sort, and sample sort.","PeriodicalId":115391,"journal":{"name":"2022 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128961580","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Flexible Binary Instrumentation Framework to Profile Code Running on Intel GPUs 灵活的二进制工具框架配置代码运行在英特尔gpu上
2022 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS) Pub Date : 2022-05-01 DOI: 10.1109/ispass55109.2022.00011
Alex Skaletsky, Konstantin Levit-Gurevich, Michael Berezalsky, Yulia Kuznetcova, Hila Yakov
{"title":"Flexible Binary Instrumentation Framework to Profile Code Running on Intel GPUs","authors":"Alex Skaletsky, Konstantin Levit-Gurevich, Michael Berezalsky, Yulia Kuznetcova, Hila Yakov","doi":"10.1109/ispass55109.2022.00011","DOIUrl":"https://doi.org/10.1109/ispass55109.2022.00011","url":null,"abstract":"Functional and performance profiling of workloads is critical in developing software and hardware. Binary Instrumentation Technology has played a key role in this task for many years in the world of x86 architecture. However, such capabilities have not been available until recently for graphics devices, especially in the Intel Graphics Processing Unit world. The GTPin framework is the only tool that supports profiling graphics and GP-GPU kernels running on extremely parallel Intel GPU devices. GTPin supports a wide range of capabilities for software and hardware developers. With GTPin, you can profile real-world graphics and compute applications at a level of performance close to real hardware. Such an ability is critical in accelerating hardware and software readiness.","PeriodicalId":115391,"journal":{"name":"2022 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)","volume":"239 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127298566","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Spatiotemporal Strategies for Long-Term FPGA Resource Management 长期FPGA资源管理的时空策略
2022 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS) Pub Date : 2022-05-01 DOI: 10.1109/ispass55109.2022.00026
Atefeh Mehrabi, Daniel J. Sorin, Benjamin C. Lee
{"title":"Spatiotemporal Strategies for Long-Term FPGA Resource Management","authors":"Atefeh Mehrabi, Daniel J. Sorin, Benjamin C. Lee","doi":"10.1109/ispass55109.2022.00026","DOIUrl":"https://doi.org/10.1109/ispass55109.2022.00026","url":null,"abstract":"The deployment of increasingly large and capable FPGAs has motivated mechanisms for sharing them, but system support for FPGAs is not yet mature. Traditional scheduling algorithms do not account for the unique characteristics of FPGAs, leading to infeasible or inefficient allocations. We propose a novel scheduling policy, called Spatiotemporal FPGA Scheduling, that overcomes these challenges to achieve long-term target allocations by tracking and correcting deviations from targets across management time periods. Compared to traditional algorithms, Spatiotemporal FPGA Scheduling produces allocations that are up to 32% closer to targets, improves average throughput by up to 44%, and improves average FPGA utilization by up to 23%.","PeriodicalId":115391,"journal":{"name":"2022 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)","volume":"7 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"120956544","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
FOURST: A code generator for FFT-based fast stencil computations 基于fft的快速模板计算的代码生成器
2022 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS) Pub Date : 2022-05-01 DOI: 10.1109/ispass55109.2022.00010
Zafar Ahmad, M. Javanmard, Gregory Croisdale, Aaron Gregory, P. Ganapathi, L. Pouchet, R. Chowdhury
{"title":"FOURST: A code generator for FFT-based fast stencil computations","authors":"Zafar Ahmad, M. Javanmard, Gregory Croisdale, Aaron Gregory, P. Ganapathi, L. Pouchet, R. Chowdhury","doi":"10.1109/ispass55109.2022.00010","DOIUrl":"https://doi.org/10.1109/ispass55109.2022.00010","url":null,"abstract":"Stencil computations are ubiquitous in modern grid-based physical simulations. In this paper, we present FOURST – a compiler to generate programs computing time iterated linear periodic and aperiodic stencil computations with fast Fourier transform methods. This paper outlines the design and implementation of the code generation approach in FOURST, to automatically generate FFT-based stencil solvers. We present experimental results on the state-of-the-art Ookami supercomputer housing Fujitsu A64FX and Intel Skylake processors, to study the performance of FOURST and a state-of-the-art tiling-based optimized code generator PLuTo on various stencil shapes and varying the number of time iterations. We discuss the performance profiles, and limitations, of both approaches on high-end modern hardware.","PeriodicalId":115391,"journal":{"name":"2022 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)","volume":"40 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125619691","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Simulating Noisy Channels in DNA Storage 模拟DNA存储中的噪声信道
2022 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS) Pub Date : 2022-05-01 DOI: 10.1109/ispass55109.2022.00019
Mayank Keoliya, Purusotam Sharma, Djordje Jevdjic
{"title":"Simulating Noisy Channels in DNA Storage","authors":"Mayank Keoliya, Purusotam Sharma, Djordje Jevdjic","doi":"10.1109/ispass55109.2022.00019","DOIUrl":"https://doi.org/10.1109/ispass55109.2022.00019","url":null,"abstract":"Compared to conventional storage mediums, DNA-based data storage offers benefits such as durability, high density and low energy consumption. With increased demand for DNA data storage, it has become important to quickly evaluate proposed approaches. However, experiments that involves reading and writing synthetic DNA are costly and time-consuming, thus requiring cheap and fast simulation prior to experimentation. DNA sequencing technologies such as Nanopore and Illumina have highly characteristic error profiles, and simulating them is challenging. We propose a DNA simulator for Nanopore data that improves on existing simulators by incorporating key parameters; our simulator better converges to error profiles of real data on most parameters.We show that the spatial distribution of errors within a strand is a key determinant of trace reconstruction accuracy; which is a factor that had not been considered by existing simulators.","PeriodicalId":115391,"journal":{"name":"2022 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)","volume":"73 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131750926","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
A SIMT Analyzer for Multi-Threaded CPU Applications 多线程CPU应用程序的SIMT分析器
2022 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS) Pub Date : 2022-05-01 DOI: 10.1109/ispass55109.2022.00037
Ahmad Alawneh, Mahmoud Khairy, Timothy G. Rogers
{"title":"A SIMT Analyzer for Multi-Threaded CPU Applications","authors":"Ahmad Alawneh, Mahmoud Khairy, Timothy G. Rogers","doi":"10.1109/ispass55109.2022.00037","DOIUrl":"https://doi.org/10.1109/ispass55109.2022.00037","url":null,"abstract":"The use of GPUs for general purpose applications has drastically increased. However, the performance gain from porting multithreaded CPU workloads to massively parallel SIMT-based accelerators, like GPUs, is often unpredictable. Even with enough parallelism, programmers do not know if their CPU code will run well on a GPU without first investing the effort to refactor it into a GPGPU programming language. Most of this unpredictability stems from two key side-effects of the GPU’s energy-efficient SIMT hardware: control-flow and memory divergence.To alleviate this issue, we propose SIMTec, an analysis tool that computes the control-flow and memory divergence of arbitrary pre-compiled CPU binaries. The tool constructs and analyzes a dynamic control flow graph of the application, batches threads into warps and emulates the operation of a SIMT stack for each warp to compute the projected SIMT efficiency. Given each warp’s execution mask, memory coalescing is computed using the addresses accessed by memory instructions from parallel threads. The tool reports the SIMT efficiency and memory divergence characteristics.We validate SIMTec using a suite of 11 applications with both x86 CPU and CUDA GPU implementations on an NVIDIA Volta V100, demonstrating that SIMTec has a correlation factor of 1.00 and 0.98 for SIMT efficiency and memory divergence, respectively. To demonstrate the predictive power of SIMTec, we explore another 16 CPU workloads for which there is no 1:1 GPU implementation. We perform case studies on these applications that range from compute-intensive thread-parallel workloads to cloud-based request-parallel microservices. Using SIMTec, we demonstrate that many of these CPU-only workloads are amenable to SIMT acceleration as-is.","PeriodicalId":115391,"journal":{"name":"2022 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)","volume":"3 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114350819","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
FASE: A Fast, Accurate and Seamless Emulator for Custom Numerical Formats FASE:一个快速,准确和无缝的自定义数字格式模拟器
2022 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS) Pub Date : 2022-05-01 DOI: 10.1109/ISPASS55109.2022.00017
John Osorio Ríos, Adrià Armejach, E. Petit, G. Henry, Marc Casas
{"title":"FASE: A Fast, Accurate and Seamless Emulator for Custom Numerical Formats","authors":"John Osorio Ríos, Adrià Armejach, E. Petit, G. Henry, Marc Casas","doi":"10.1109/ISPASS55109.2022.00017","DOIUrl":"https://doi.org/10.1109/ISPASS55109.2022.00017","url":null,"abstract":"Deep Neural Networks (DNNs) have become ubiquitous in a wide range of application domains. Despite their success, training DNNs is an expensive task that has motivated the use of reduced numerical precision formats to improve performance and reduce power consumption. Emulation techniques are a good fit to understand the properties of new numerical formats on a particular workload. However, current SoA techniques are not able to perform these tasks quickly and accurately on a wide variety of workloads.We propose FASE, a Fast, Accurate, and Seamless Emulator that leverages dynamic binary translation to enable emulation of custom numerical formats. FASE is fast: allowing emulation of large unmodified workloads; accurate: emulating at the instruction operand level; and seamless: as it does not require any code modifications and works on any application or DNN framework without any language, compiler, or source code access restrictions.","PeriodicalId":115391,"journal":{"name":"2022 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)","volume":"115 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123464285","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
相关产品
×
本文献相关产品
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信