多线程CPU应用程序的SIMT分析器

2022 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS) Pub Date : 2022-05-01 DOI:10.1109/ispass55109.2022.00037

Ahmad Alawneh, Mahmoud Khairy, Timothy G. Rogers

{"title":"多线程CPU应用程序的SIMT分析器","authors":"Ahmad Alawneh, Mahmoud Khairy, Timothy G. Rogers","doi":"10.1109/ispass55109.2022.00037","DOIUrl":null,"url":null,"abstract":"The use of GPUs for general purpose applications has drastically increased. However, the performance gain from porting multithreaded CPU workloads to massively parallel SIMT-based accelerators, like GPUs, is often unpredictable. Even with enough parallelism, programmers do not know if their CPU code will run well on a GPU without first investing the effort to refactor it into a GPGPU programming language. Most of this unpredictability stems from two key side-effects of the GPU’s energy-efficient SIMT hardware: control-flow and memory divergence.To alleviate this issue, we propose SIMTec, an analysis tool that computes the control-flow and memory divergence of arbitrary pre-compiled CPU binaries. The tool constructs and analyzes a dynamic control flow graph of the application, batches threads into warps and emulates the operation of a SIMT stack for each warp to compute the projected SIMT efficiency. Given each warp’s execution mask, memory coalescing is computed using the addresses accessed by memory instructions from parallel threads. The tool reports the SIMT efficiency and memory divergence characteristics.We validate SIMTec using a suite of 11 applications with both x86 CPU and CUDA GPU implementations on an NVIDIA Volta V100, demonstrating that SIMTec has a correlation factor of 1.00 and 0.98 for SIMT efficiency and memory divergence, respectively. To demonstrate the predictive power of SIMTec, we explore another 16 CPU workloads for which there is no 1:1 GPU implementation. We perform case studies on these applications that range from compute-intensive thread-parallel workloads to cloud-based request-parallel microservices. Using SIMTec, we demonstrate that many of these CPU-only workloads are amenable to SIMT acceleration as-is.","PeriodicalId":115391,"journal":{"name":"2022 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)","volume":"3 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":"{\"title\":\"A SIMT Analyzer for Multi-Threaded CPU Applications\",\"authors\":\"Ahmad Alawneh, Mahmoud Khairy, Timothy G. Rogers\",\"doi\":\"10.1109/ispass55109.2022.00037\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"The use of GPUs for general purpose applications has drastically increased. However, the performance gain from porting multithreaded CPU workloads to massively parallel SIMT-based accelerators, like GPUs, is often unpredictable. Even with enough parallelism, programmers do not know if their CPU code will run well on a GPU without first investing the effort to refactor it into a GPGPU programming language. Most of this unpredictability stems from two key side-effects of the GPU’s energy-efficient SIMT hardware: control-flow and memory divergence.To alleviate this issue, we propose SIMTec, an analysis tool that computes the control-flow and memory divergence of arbitrary pre-compiled CPU binaries. The tool constructs and analyzes a dynamic control flow graph of the application, batches threads into warps and emulates the operation of a SIMT stack for each warp to compute the projected SIMT efficiency. Given each warp’s execution mask, memory coalescing is computed using the addresses accessed by memory instructions from parallel threads. The tool reports the SIMT efficiency and memory divergence characteristics.We validate SIMTec using a suite of 11 applications with both x86 CPU and CUDA GPU implementations on an NVIDIA Volta V100, demonstrating that SIMTec has a correlation factor of 1.00 and 0.98 for SIMT efficiency and memory divergence, respectively. To demonstrate the predictive power of SIMTec, we explore another 16 CPU workloads for which there is no 1:1 GPU implementation. We perform case studies on these applications that range from compute-intensive thread-parallel workloads to cloud-based request-parallel microservices. Using SIMTec, we demonstrate that many of these CPU-only workloads are amenable to SIMT acceleration as-is.\",\"PeriodicalId\":115391,\"journal\":{\"name\":\"2022 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)\",\"volume\":\"3 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2022-05-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"1\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2022 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ispass55109.2022.00037\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2022 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ispass55109.2022.00037","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 1

摘要

通用应用程序对gpu的使用急剧增加。然而，将多线程CPU工作负载移植到大规模并行的基于simm的加速器(如gpu)所获得的性能收益通常是不可预测的。即使有足够的并行性，程序员也不知道他们的CPU代码是否能在GPU上运行良好，除非首先投入精力将其重构为GPGPU编程语言。这种不可预测性主要源于GPU节能SIMT硬件的两个关键副作用:控制流和内存发散。为了缓解这个问题，我们提出了SIMTec，一个分析工具，计算任意预编译的CPU二进制文件的控制流和内存散度。该工具构建并分析了应用程序的动态控制流图，将线程分批放入经纱中，并对每个经纱模拟SIMT堆栈的操作，以计算预计的SIMT效率。给定每个warp的执行掩码，使用并行线程的内存指令访问的地址来计算内存合并。该工具报告SIMT效率和内存发散特性。我们在NVIDIA Volta V100上使用一套包含x86 CPU和CUDA GPU实现的11个应用程序来验证SIMTec，证明SIMTec在SIMT效率和内存发散方面分别具有1.00和0.98的相关系数。为了展示SIMTec的预测能力，我们探索了另外16个CPU工作负载，其中没有1:1 GPU实现。我们对这些应用程序进行案例研究，范围从计算密集型线程并行工作负载到基于云的请求并行微服务。使用SIMTec，我们演示了许多这些只使用cpu的工作负载都可以按原样进行SIMT加速。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

A SIMT Analyzer for Multi-Threaded CPU Applications

The use of GPUs for general purpose applications has drastically increased. However, the performance gain from porting multithreaded CPU workloads to massively parallel SIMT-based accelerators, like GPUs, is often unpredictable. Even with enough parallelism, programmers do not know if their CPU code will run well on a GPU without first investing the effort to refactor it into a GPGPU programming language. Most of this unpredictability stems from two key side-effects of the GPU’s energy-efficient SIMT hardware: control-flow and memory divergence.To alleviate this issue, we propose SIMTec, an analysis tool that computes the control-flow and memory divergence of arbitrary pre-compiled CPU binaries. The tool constructs and analyzes a dynamic control flow graph of the application, batches threads into warps and emulates the operation of a SIMT stack for each warp to compute the projected SIMT efficiency. Given each warp’s execution mask, memory coalescing is computed using the addresses accessed by memory instructions from parallel threads. The tool reports the SIMT efficiency and memory divergence characteristics.We validate SIMTec using a suite of 11 applications with both x86 CPU and CUDA GPU implementations on an NVIDIA Volta V100, demonstrating that SIMTec has a correlation factor of 1.00 and 0.98 for SIMT efficiency and memory divergence, respectively. To demonstrate the predictive power of SIMTec, we explore another 16 CPU workloads for which there is no 1:1 GPU implementation. We perform case studies on these applications that range from compute-intensive thread-parallel workloads to cloud-based request-parallel microservices. Using SIMTec, we demonstrate that many of these CPU-only workloads are amenable to SIMT acceleration as-is.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2022 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)

自引率

0.00%

发文量