ACM Transactions on Architecture and Code Optimization最新文献_第8页

PARALiA : A Performance Aware Runtime for Auto-tuning Linear Algebra on heterogeneous systems 在异构系统上自动调优线性代数的性能感知运行时

3区计算机科学

ACM Transactions on Architecture and Code Optimization Pub Date : 2023-09-15 DOI: 10.1145/3624569

Petros Anastasiadis, Nikela Papadopoulou, Georgios Goumas, Nectarios Koziris, Dennis Hoppe, Li Zhong

{"title":"PARALiA : A Performance Aware Runtime for Auto-tuning Linear Algebra on heterogeneous systems","authors":"Petros Anastasiadis, Nikela Papadopoulou, Georgios Goumas, Nectarios Koziris, Dennis Hoppe, Li Zhong","doi":"10.1145/3624569","DOIUrl":"https://doi.org/10.1145/3624569","url":null,"abstract":"Dense linear algebra operations appear very frequently in high-performance computing (HPC) applications, rendering their performance crucial to achieve optimal scalability. As many modern HPC clusters contain multi-GPU nodes, BLAS operations are frequently offloaded on GPUs, necessitating the use of optimized libraries to ensure good performance. Unfortunately, multi-GPU systems are accompanied by two significant optimization challenges: data transfer bottlenecks as well as problem splitting and scheduling in multiple workers (GPUs) with distinct memories. We demonstrate that the current multi-GPU BLAS methods for tackling these challenges target very specific problem and data characteristics, resulting in serious performance degradation for any slightly deviating workload. Additionally, an even more critical decision is omitted because it cannot be addressed using current scheduler-based approaches: the determination of which devices should be used for a certain routine invocation. To address these issues we propose a model-based approach: using performance estimation to provide problem-specific autotuning during runtime. We integrate this autotuning into an end-to-end BLAS framework named PARALiA. This framework couples autotuning with an optimized task scheduler, leading to near-optimal data distribution and performance-aware resource utilization. We evaluate PARALiA in an HPC testbed with 8 NVIDIA-V100 GPUs, improving the average performance of GEMM by 1.7X and energy efficiency by 2.5X over the state-of-the-art in a large and diverse dataset and demonstrating the adaptability of our performance-aware approach to future heterogeneous systems.","PeriodicalId":50920,"journal":{"name":"ACM Transactions on Architecture and Code Optimization","volume":"2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-09-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135397586","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

MicroProf: Code-level Attribution of Unnecessary Data Transfer in Microservice Applications MicroProf:微服务应用程序中不必要数据传输的代码级归属

IF 1.6 3区计算机科学

ACM Transactions on Architecture and Code Optimization Pub Date : 2023-09-08 DOI: 10.1145/3622787

Syed Salauddin Mohammad Tariq, Lance Menard, Pengfei Su, Probir Roy

{"title":"MicroProf: Code-level Attribution of Unnecessary Data Transfer in Microservice Applications","authors":"Syed Salauddin Mohammad Tariq, Lance Menard, Pengfei Su, Probir Roy","doi":"10.1145/3622787","DOIUrl":"https://doi.org/10.1145/3622787","url":null,"abstract":"The microservice architecture style has gained popularity due to its ability to fault isolation, ease of scaling applications, and developer’s agility. However, writing applications in the microservice design style has its challenges. Due to the loosely coupled nature, services communicate with others through standard communication APIs. This incurs significant overhead in the application due to communication protocol and data transformation. An inefficient service communication at the microservice application logic can further overwhelm the application. We perform a grey literature review showing that unnecessary data transfer is a real challenge in the industry. To the best of our knowledge, no effective tool is currently available to accurately identify the origins of unnecessary microservice communications that lead to significant performance overhead and provide guidance for optimization. To bridge the knowledge gap, we propose MicroProf, a dynamic program analysis tool to detect unnecessary data transfer in Java-based microservice applications. At the implementation level, MicroProf proposes novel techniques such as remote object sampling and hardware debug registers to monitor remote object usage. MicroProf reports the unnecessary data transfer at the application source code level. Furthermore, MicroProf pinpoints the opportunities for communication API optimization. MicroProf is evaluated on four well-known applications involving two real-world applications and two benchmarks, identifying five inefficient remote invocations. Guided by MicroProf, API optimization achieves an 87.5% reduction in the number of fields within REST API responses. The empirical evaluation further reveals that the optimized services experience a speedup of up to 4.59 ×.","PeriodicalId":50920,"journal":{"name":"ACM Transactions on Architecture and Code Optimization","volume":"28 1","pages":""},"PeriodicalIF":1.6,"publicationDate":"2023-09-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"87132560","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

A Compilation Tool for Computation Offloading in ReRAM-based CIM Architectures 基于rram的CIM体系结构中计算卸载的编译工具

IF 1.6 3区计算机科学

ACM Transactions on Architecture and Code Optimization Pub Date : 2023-09-05 DOI: 10.1145/3617686

Hai Jin, Bo Lei, Haikun Liu, Xiaofei Liao, Zhuohui Duan, Chencheng Ye, Yu Zhang

{"title":"A Compilation Tool for Computation Offloading in ReRAM-based CIM Architectures","authors":"Hai Jin, Bo Lei, Haikun Liu, Xiaofei Liao, Zhuohui Duan, Chencheng Ye, Yu Zhang","doi":"10.1145/3617686","DOIUrl":"https://doi.org/10.1145/3617686","url":null,"abstract":"Computing-In-Memory (CIM) architectures using Non-Volatile Memories (NVMs) have emerged as a promising way to address the “memory wall” problem in traditional Von Neumann architectures. CIM accelerators can perform arithmetic or Boolean logic operations in NVMs by fully exploiting their high parallelism for bit-wise operations. These accelerators are often used in cooperation with general-purpose processors to speed up a wide variety of artificial neural network applications. In such a heterogeneous computing architecture, the legacy software should be redesigned and re-engineered to utilize new CIM accelerators. In this paper, we propose a compilation tool to automatically migrate legacy programs to such heterogeneous architectures based on the LLVM compiler infrastructure. To accelerate some computations such as vector-matrix multiplication in CIM accelerators, we identify several typical computing patterns from LLVM intermediate representations (IRs), which are oblivious to high-level programming paradigms. Our compilation tool can modify acceleratable LLVM IRs to offload them to CIM accelerators automatically, without re-engineering legacy software. Experimental results show that our compilation tool can translate many legacy programs to CIM-supported binary executables effectively, and improve application performance and energy efficiency by up to 51 × and 309 ×, respectively, compared with general-purpose x86 processors.","PeriodicalId":50920,"journal":{"name":"ACM Transactions on Architecture and Code Optimization","volume":"57 1","pages":""},"PeriodicalIF":1.6,"publicationDate":"2023-09-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"80969180","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Smart-DNN+: a Memory-Efficient Neural Networks Compression Framework for the Model Inference Smart-DNN+:用于模型推理的高效记忆神经网络压缩框架

IF 1.6 3区计算机科学

ACM Transactions on Architecture and Code Optimization Pub Date : 2023-08-30 DOI: 10.1145/3617688

Donglei Wu, Weihao Yang, Xiangyu Zou, Wen Xia, Shiyi Li, Zhenbo Hu, Weizhe Zhang, Binxing Fang

{"title":"Smart-DNN+: a Memory-Efficient Neural Networks Compression Framework for the Model Inference","authors":"Donglei Wu, Weihao Yang, Xiangyu Zou, Wen Xia, Shiyi Li, Zhenbo Hu, Weizhe Zhang, Binxing Fang","doi":"10.1145/3617688","DOIUrl":"https://doi.org/10.1145/3617688","url":null,"abstract":"Deep Neural Networks (DNNs) have achieved remarkable success in various real-world applications. However, running a DNN typically requires hundreds of megabytes of memory footprints, making it challenging to deploy on resource-constrained platforms such as mobile devices and IoT. Although mainstream DNNs compression techniques such as pruning, distillation, and quantization can reduce the memory overhead of model parameters during DNN inference, they suffer from three limitations: (i) low model compression ratio for the lightweight DNN structures with little redundancy; (ii) potential degradation in model inference accuracy; (iii) inadequate memory compression ratio is attributable to ignoring the layering property of DNN inference. To address these issues, we propose a lightweight memory-efficient DNN inference framework called Smart-DNN+, which significantly reduces the memory costs of DNN inference without degrading the model quality. Specifically, ① Smart-DNN+ applies a layer-wise binary-quantizer with a remapping mechanism to greatly reduce the model size by quantizing the typical floating-point DNN weights of 32-bit to the 1-bit signs layer by layer. To maintain model quality, ② Smart-DNN+ employs a bucket-encoder to keep the compressed quantization error by encoding the multiple similar floating-point residuals into the same integer bucket IDs. When running the compressed DNN in the user’s device, ③ Smart-DNN+ utilizes a partially decompressing strategy to greatly reduce the required memory overhead by first loading the compressed DNNs in memory and then dynamically decompressing the required materials for model inference layer by layer. Experimental results on popular DNNs and datasets demonstrate that Smart-DNN+ achieves lower 0.17 (% ) -0.92 (% ) memory costs at lower runtime overheads compared with the state of the arts without degrading the inference accuracy. Moreover, Smart-DNN+ potentially reduces the inference runtime up to 2.04 × that of conventional DNN inference workflow.","PeriodicalId":50920,"journal":{"name":"ACM Transactions on Architecture and Code Optimization","volume":"27 1","pages":""},"PeriodicalIF":1.6,"publicationDate":"2023-08-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"77083358","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

RACE: An Efficient Redundancy-aware Accelerator for Dynamic Graph Neural Network RACE:一种高效的动态图神经网络冗余感知加速器

IF 1.6 3区计算机科学

ACM Transactions on Architecture and Code Optimization Pub Date : 2023-08-30 DOI: 10.1145/3617685

Hui Yu, Yu Zhang, Jin Zhao, Yujian Liao, Zhiying Huang, Donghao He, Lin Gu, Hai Jin, Xiaofei Liao, Haikun Liu, Bin He, Jianhui Yue

{"title":"RACE: An Efficient Redundancy-aware Accelerator for Dynamic Graph Neural Network","authors":"Hui Yu, Yu Zhang, Jin Zhao, Yujian Liao, Zhiying Huang, Donghao He, Lin Gu, Hai Jin, Xiaofei Liao, Haikun Liu, Bin He, Jianhui Yue","doi":"10.1145/3617685","DOIUrl":"https://doi.org/10.1145/3617685","url":null,"abstract":"Dynamic Graph Neural Network (DGNN) has recently attracted a significant amount of research attention from various domains, because most real-world graphs are inherently dynamic. Despite of many research effort, for DGNN, existing hardware/software solutions still suffer significantly from redundant computation and memory access overhead, because they need to irregularly access and recompute all graph data of each graph snapshot. To address these issues, we propose an efficient redundancy-aware accelerator, RACE, which enables energy-efficient execution of DGNN models. Specifically, we propose a redundancy-aware incremental execution approach into the accelerator design for DGNN to instantly achieve the output features of the latest graph snapshot by correctly and incrementally refining the output features of the previous graph snapshot and also enable regular accesses of vertices’ input features. Through traversing the graph on the fly, RACE identifies the vertices which are not affected by graph updates between successive snapshots to reuse these vertices’ states (i.e., their output features) of the previous snapshot for the processing of the latest snapshot. The vertices affected by graph updates are also tracked to incrementally recompute their new states using their neighbors’ input features of the latest snapshot for correctness. In this way, the processing and accessing of many graph data which are not affected by graph updates can be correctly eliminated, enabling smaller redundant computation and memory access overhead. Besides, the input features, which are accessed more frequently, are dynamically identified according to graph topology and are preferentially resident in the on-chip memory for less off-chip communications. Experimental results show that RACE achieves on average 1139x and 84.7x speedups for DGNN inference, with average 2242x and 234.2x energy savings, in comparison with the state-of-the-art software DGNN running on Intel Xeon CPU and NVIDIA A100 GPU, respectively. Moreover, for DGNN inference, RACE obtains on average 13.1x, 11.7x, 10.4x, 7.9x speedup and average 14.8x, 12.9x, 11.5x, 8.9x energy saving over the state-of-the-art GNN accelerators, i.e., AWB-GCN, GCNAX, ReGNN, and I-GCN, respectively.","PeriodicalId":50920,"journal":{"name":"ACM Transactions on Architecture and Code Optimization","volume":"13 1","pages":""},"PeriodicalIF":1.6,"publicationDate":"2023-08-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"76021646","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Improving Computation and Memory Efficiency for Real-world Transformer Inference on GPUs 提高gpu上真实世界变压器推理的计算和内存效率

IF 1.6 3区计算机科学

ACM Transactions on Architecture and Code Optimization Pub Date : 2023-08-26 DOI: 10.1145/3617689

Jiangsu Du, Jiazhi Jiang, Jiang Zheng, Hongbin Zhang, Dan-E Huang, Yutong Lu

{"title":"Improving Computation and Memory Efficiency for Real-world Transformer Inference on GPUs","authors":"Jiangsu Du, Jiazhi Jiang, Jiang Zheng, Hongbin Zhang, Dan-E Huang, Yutong Lu","doi":"10.1145/3617689","DOIUrl":"https://doi.org/10.1145/3617689","url":null,"abstract":"Transformer models have emerged as a leading approach in the field of natural language processing (NLP) and are increasingly being deployed in production environments. Graphic processing units (GPUs) have become a popular choice for the transformer deployment, and often rely on the batch processing technique to ensure high hardware performance. Nonetheless, the current practice for transformer inference encounters computational and memory redundancy due to the heavy-tailed distribution of sequence lengths in NLP scenarios, resulting in low practical performance. In this paper, we propose a unified solution for improving both computation and memory efficiency of the real-world transformer inference on GPUs. The solution eliminates the redundant computation and memory footprint across a transformer model. At first, a GPU-oriented computation approach is proposed to process the self-attention module in a fine-grained manner, eliminating its redundant computation. Next, the multi-layer perceptron module continues to use the word-accumulation approach to eliminate its redundant computation. Then, to better unify the fine-grained approach and the word-accumulation approach, it organizes the data layout of the self-attention module in block granularity. Since aforementioned approaches make the required memory size largely reduce and constantly fluctuate, we propose the chunk-based approach to enable a better balance between memory footprint and allocation/free efficiency. Our experimental results show that our unified solution achieves a decrease of average latency by 28 (% ) on the entire transformer model, 63.8 (% ) on the self-attention module and reduces memory footprint of intermediate results by 7.8 ×, compared with prevailing frameworks.","PeriodicalId":50920,"journal":{"name":"ACM Transactions on Architecture and Code Optimization","volume":"54 1","pages":""},"PeriodicalIF":1.6,"publicationDate":"2023-08-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"83275360","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Approx-RM: Reducing Energy on Heterogeneous Multicore Processors under Accuracy and Timing Constraints 近似rm:在精度和时间约束下减少异构多核处理器的能量

IF 1.6 3区计算机科学

ACM Transactions on Architecture and Code Optimization Pub Date : 2023-07-22 DOI: https://dl.acm.org/doi/10.1145/3605214

Muhammad Waqar Azhar, Madhavan Manivannan, Per Stenström

{"title":"Approx-RM: Reducing Energy on Heterogeneous Multicore Processors under Accuracy and Timing Constraints","authors":"Muhammad Waqar Azhar, Madhavan Manivannan, Per Stenström","doi":"https://dl.acm.org/doi/10.1145/3605214","DOIUrl":"https://doi.org/https://dl.acm.org/doi/10.1145/3605214","url":null,"abstract":"Reducing energy consumption while providing performance and quality guarantees is crucial for computing systems ranging from battery-powered embedded systems to data centers. This article considers approximate iterative applications executing on heterogeneous multi-core platforms under user-specified performance and quality targets. We note that allowing a slight yet bounded relaxation in solution quality can considerably reduce the required iteration count and thereby can save significant amounts of energy. To this end, this article proposes Approx-RM, a resource management scheme that reduces energy expenditure while guaranteeing a specified performance as well as accuracy target. Approx-RM predicts the number of iterations required to meet the relaxed accuracy target at runtime. The time saved generates execution-time slack, which allows Approx-RM to allocate fewer resources on a heterogeneous multi-core platform in terms of DVFS, core type, and core count to save energy while meeting the performance target. Approx-RM contributes with lightweight methods for predicting the iteration count needed to meet the accuracy target and the resources needed to meet the performance target. Approx-RM uses the aforementioned predictions to allocate just enough resources to comply with quality of service constraints to save energy. Our evaluation shows energy savings of 31.6%, on average, compared to Race-to-idle when the accuracy is only relaxed by 1%. Approx-RM incurs timing and energy overheads of less than 0.1%.","PeriodicalId":50920,"journal":{"name":"ACM Transactions on Architecture and Code Optimization","volume":"75 4 1","pages":""},"PeriodicalIF":1.6,"publicationDate":"2023-07-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138529908","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

MFFT: A GPU Accelerated Highly Efficient Mixed-Precision Large-Scale FFT Framework 一个GPU加速的高效混合精度大规模FFT框架

IF 1.6 3区计算机科学

ACM Transactions on Architecture and Code Optimization Pub Date : 2023-07-22 DOI: https://dl.acm.org/doi/10.1145/3605148

Yuwen Zhao, Fangfang Liu, Wenjing Ma, Huiyuan Li, Yuanchi Peng, Cui Wang

{"title":"MFFT: A GPU Accelerated Highly Efficient Mixed-Precision Large-Scale FFT Framework","authors":"Yuwen Zhao, Fangfang Liu, Wenjing Ma, Huiyuan Li, Yuanchi Peng, Cui Wang","doi":"https://dl.acm.org/doi/10.1145/3605148","DOIUrl":"https://doi.org/https://dl.acm.org/doi/10.1145/3605148","url":null,"abstract":"Fast Fourier transform (FFT) is widely used in computing applications in large-scale parallel programs, and data communication is the main performance bottleneck of FFT and seriously affects its parallel efficiency. To tackle this problem, we propose a new large-scale FFT framework, MFFT, which optimizes parallel FFT with a new mixed-precision optimization technique, adopting the “high precision computation, low precision communication” strategy. To enable “low precision communication”, we propose a shared-exponent floating-point number compression technique, which reduces the volume of data communication, while maintaining higher accuracy. In addition, we apply a two-phase normalization technique to further reduce the round-off error. Based on the mixed-precision MFFT framework, we apply several optimization techniques to improve the performance, such as streaming of GPU kernels, MPI message combination, kernel optimization, and memory optimization. We evaluate MFFT on a system with 4,096 GPUs. The results show that shared-exponent MFFT is 1.23 × faster than that of double-precision MFFT on average, and double-precision MFFT achieves performance 3.53× and 9.48× on average higher than open source library 2Decomp&FFT (CPU-based version) and heFFTe (AMD GPU-based version), respectively. The parallel efficiency of double-precision MFFT increased from 53.2% to 78.1% compared with 2Decomp&FFT, and shared-exponent MFFT further increases the parallel efficiency to 83.8%.","PeriodicalId":50920,"journal":{"name":"ACM Transactions on Architecture and Code Optimization","volume":"5 1","pages":""},"PeriodicalIF":1.6,"publicationDate":"2023-07-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138529813","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Accelerating Convolutional Neural Network by Exploiting Sparsity on GPUs 利用gpu上的稀疏性加速卷积神经网络

IF 1.6 3区计算机科学

ACM Transactions on Architecture and Code Optimization Pub Date : 2023-07-19 DOI: https://dl.acm.org/doi/10.1145/3600092

Weizhi Xu, Yintai Sun, Shengyu Fan, Hui Yu, Xin Fu

{"title":"Accelerating Convolutional Neural Network by Exploiting Sparsity on GPUs","authors":"Weizhi Xu, Yintai Sun, Shengyu Fan, Hui Yu, Xin Fu","doi":"https://dl.acm.org/doi/10.1145/3600092","DOIUrl":"https://doi.org/https://dl.acm.org/doi/10.1145/3600092","url":null,"abstract":"The convolutional neural network (CNN) is an important deep learning method, which is widely used in many fields. However, it is very time consuming to implement the CNN where convolution usually takes most of the time. There are many zero values in feature maps and filters, which leads to redundant calculations and memory accesses if dense methods are used to compute convolution. Many works recently have made use of sparsity to skip the calculations for zero values to reduce the inference time of the CNN. On the graphics processing unit platform, current works cannot fully exploit the sparsity of the feature map and achieve satisfactory performance. Therefore, we design a new parallel strategy to transform the feature map into a new storage format to avoid the redundant computation of zero values on graphics processing units. Also considering the sparsity in the feature map, we propose a fused storage format to combine the convolution operation with the following pooling operation, to further improve the performance. We carry out experiments with mainstream CNN models and achieve better performance compared with cuDNN and cuSPARSE. For VGG-19, ResNet-50, DenseNet-121, and RegNetX-16GF, 1.97×, 2.23×, 2.74×, and 1.58× speedups respectively are obtained over cuDNN. The speedups over cuSPARSE respectively are 2.10×, 1.83×, 2.35×, and 1.35× when only using the first method.","PeriodicalId":50920,"journal":{"name":"ACM Transactions on Architecture and Code Optimization","volume":"875 1","pages":""},"PeriodicalIF":1.6,"publicationDate":"2023-07-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138529849","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

rNdN: Fast Query Compilation for NVIDIA GPUs rNdN: NVIDIA gpu快速查询编译

IF 1.6 3区计算机科学

ACM Transactions on Architecture and Code Optimization Pub Date : 2023-07-19 DOI: https://dl.acm.org/doi/10.1145/3603503

Alexander Krolik, Clark Verbrugge, Laurie Hendren

引用次数: 0