ACM Transactions on Architecture and Code Optimization最新文献

筛选
英文 中文
PARALiA : A Performance Aware Runtime for Auto-tuning Linear Algebra on heterogeneous systems 在异构系统上自动调优线性代数的性能感知运行时
3区 计算机科学
ACM Transactions on Architecture and Code Optimization Pub Date : 2023-09-15 DOI: 10.1145/3624569
Petros Anastasiadis, Nikela Papadopoulou, Georgios Goumas, Nectarios Koziris, Dennis Hoppe, Li Zhong
{"title":"PARALiA : A Performance Aware Runtime for Auto-tuning Linear Algebra on heterogeneous systems","authors":"Petros Anastasiadis, Nikela Papadopoulou, Georgios Goumas, Nectarios Koziris, Dennis Hoppe, Li Zhong","doi":"10.1145/3624569","DOIUrl":"https://doi.org/10.1145/3624569","url":null,"abstract":"Dense linear algebra operations appear very frequently in high-performance computing (HPC) applications, rendering their performance crucial to achieve optimal scalability. As many modern HPC clusters contain multi-GPU nodes, BLAS operations are frequently offloaded on GPUs, necessitating the use of optimized libraries to ensure good performance. Unfortunately, multi-GPU systems are accompanied by two significant optimization challenges: data transfer bottlenecks as well as problem splitting and scheduling in multiple workers (GPUs) with distinct memories. We demonstrate that the current multi-GPU BLAS methods for tackling these challenges target very specific problem and data characteristics, resulting in serious performance degradation for any slightly deviating workload. Additionally, an even more critical decision is omitted because it cannot be addressed using current scheduler-based approaches: the determination of which devices should be used for a certain routine invocation. To address these issues we propose a model-based approach: using performance estimation to provide problem-specific autotuning during runtime. We integrate this autotuning into an end-to-end BLAS framework named PARALiA. This framework couples autotuning with an optimized task scheduler, leading to near-optimal data distribution and performance-aware resource utilization. We evaluate PARALiA in an HPC testbed with 8 NVIDIA-V100 GPUs, improving the average performance of GEMM by 1.7X and energy efficiency by 2.5X over the state-of-the-art in a large and diverse dataset and demonstrating the adaptability of our performance-aware approach to future heterogeneous systems.","PeriodicalId":50920,"journal":{"name":"ACM Transactions on Architecture and Code Optimization","volume":"2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-09-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135397586","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
MicroProf: Code-level Attribution of Unnecessary Data Transfer in Microservice Applications MicroProf:微服务应用程序中不必要数据传输的代码级归属
IF 1.6 3区 计算机科学
ACM Transactions on Architecture and Code Optimization Pub Date : 2023-09-08 DOI: 10.1145/3622787
Syed Salauddin Mohammad Tariq, Lance Menard, Pengfei Su, Probir Roy
{"title":"MicroProf: Code-level Attribution of Unnecessary Data Transfer in Microservice Applications","authors":"Syed Salauddin Mohammad Tariq, Lance Menard, Pengfei Su, Probir Roy","doi":"10.1145/3622787","DOIUrl":"https://doi.org/10.1145/3622787","url":null,"abstract":"The microservice architecture style has gained popularity due to its ability to fault isolation, ease of scaling applications, and developer’s agility. However, writing applications in the microservice design style has its challenges. Due to the loosely coupled nature, services communicate with others through standard communication APIs. This incurs significant overhead in the application due to communication protocol and data transformation. An inefficient service communication at the microservice application logic can further overwhelm the application. We perform a grey literature review showing that unnecessary data transfer is a real challenge in the industry. To the best of our knowledge, no effective tool is currently available to accurately identify the origins of unnecessary microservice communications that lead to significant performance overhead and provide guidance for optimization. To bridge the knowledge gap, we propose MicroProf, a dynamic program analysis tool to detect unnecessary data transfer in Java-based microservice applications. At the implementation level, MicroProf proposes novel techniques such as remote object sampling and hardware debug registers to monitor remote object usage. MicroProf reports the unnecessary data transfer at the application source code level. Furthermore, MicroProf pinpoints the opportunities for communication API optimization. MicroProf is evaluated on four well-known applications involving two real-world applications and two benchmarks, identifying five inefficient remote invocations. Guided by MicroProf, API optimization achieves an 87.5% reduction in the number of fields within REST API responses. The empirical evaluation further reveals that the optimized services experience a speedup of up to 4.59 ×.","PeriodicalId":50920,"journal":{"name":"ACM Transactions on Architecture and Code Optimization","volume":"28 1","pages":""},"PeriodicalIF":1.6,"publicationDate":"2023-09-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"87132560","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A Compilation Tool for Computation Offloading in ReRAM-based CIM Architectures 基于rram的CIM体系结构中计算卸载的编译工具
IF 1.6 3区 计算机科学
ACM Transactions on Architecture and Code Optimization Pub Date : 2023-09-05 DOI: 10.1145/3617686
Hai Jin, Bo Lei, Haikun Liu, Xiaofei Liao, Zhuohui Duan, Chencheng Ye, Yu Zhang
{"title":"A Compilation Tool for Computation Offloading in ReRAM-based CIM Architectures","authors":"Hai Jin, Bo Lei, Haikun Liu, Xiaofei Liao, Zhuohui Duan, Chencheng Ye, Yu Zhang","doi":"10.1145/3617686","DOIUrl":"https://doi.org/10.1145/3617686","url":null,"abstract":"Computing-In-Memory (CIM) architectures using Non-Volatile Memories (NVMs) have emerged as a promising way to address the “memory wall” problem in traditional Von Neumann architectures. CIM accelerators can perform arithmetic or Boolean logic operations in NVMs by fully exploiting their high parallelism for bit-wise operations. These accelerators are often used in cooperation with general-purpose processors to speed up a wide variety of artificial neural network applications. In such a heterogeneous computing architecture, the legacy software should be redesigned and re-engineered to utilize new CIM accelerators. In this paper, we propose a compilation tool to automatically migrate legacy programs to such heterogeneous architectures based on the LLVM compiler infrastructure. To accelerate some computations such as vector-matrix multiplication in CIM accelerators, we identify several typical computing patterns from LLVM intermediate representations (IRs), which are oblivious to high-level programming paradigms. Our compilation tool can modify acceleratable LLVM IRs to offload them to CIM accelerators automatically, without re-engineering legacy software. Experimental results show that our compilation tool can translate many legacy programs to CIM-supported binary executables effectively, and improve application performance and energy efficiency by up to 51 × and 309 ×, respectively, compared with general-purpose x86 processors.","PeriodicalId":50920,"journal":{"name":"ACM Transactions on Architecture and Code Optimization","volume":"57 1","pages":""},"PeriodicalIF":1.6,"publicationDate":"2023-09-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"80969180","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Smart-DNN+: a Memory-Efficient Neural Networks Compression Framework for the Model Inference Smart-DNN+:用于模型推理的高效记忆神经网络压缩框架
IF 1.6 3区 计算机科学
ACM Transactions on Architecture and Code Optimization Pub Date : 2023-08-30 DOI: 10.1145/3617688
Donglei Wu, Weihao Yang, Xiangyu Zou, Wen Xia, Shiyi Li, Zhenbo Hu, Weizhe Zhang, Binxing Fang
{"title":"Smart-DNN+: a Memory-Efficient Neural Networks Compression Framework for the Model Inference","authors":"Donglei Wu, Weihao Yang, Xiangyu Zou, Wen Xia, Shiyi Li, Zhenbo Hu, Weizhe Zhang, Binxing Fang","doi":"10.1145/3617688","DOIUrl":"https://doi.org/10.1145/3617688","url":null,"abstract":"Deep Neural Networks (DNNs) have achieved remarkable success in various real-world applications. However, running a DNN typically requires hundreds of megabytes of memory footprints, making it challenging to deploy on resource-constrained platforms such as mobile devices and IoT. Although mainstream DNNs compression techniques such as pruning, distillation, and quantization can reduce the memory overhead of model parameters during DNN inference, they suffer from three limitations: (i) low model compression ratio for the lightweight DNN structures with little redundancy; (ii) potential degradation in model inference accuracy; (iii) inadequate memory compression ratio is attributable to ignoring the layering property of DNN inference. To address these issues, we propose a lightweight memory-efficient DNN inference framework called Smart-DNN+, which significantly reduces the memory costs of DNN inference without degrading the model quality. Specifically, ① Smart-DNN+ applies a layer-wise binary-quantizer with a remapping mechanism to greatly reduce the model size by quantizing the typical floating-point DNN weights of 32-bit to the 1-bit signs layer by layer. To maintain model quality, ② Smart-DNN+ employs a bucket-encoder to keep the compressed quantization error by encoding the multiple similar floating-point residuals into the same integer bucket IDs. When running the compressed DNN in the user’s device, ③ Smart-DNN+ utilizes a partially decompressing strategy to greatly reduce the required memory overhead by first loading the compressed DNNs in memory and then dynamically decompressing the required materials for model inference layer by layer. Experimental results on popular DNNs and datasets demonstrate that Smart-DNN+ achieves lower 0.17 (% ) -0.92 (% ) memory costs at lower runtime overheads compared with the state of the arts without degrading the inference accuracy. Moreover, Smart-DNN+ potentially reduces the inference runtime up to 2.04 × that of conventional DNN inference workflow.","PeriodicalId":50920,"journal":{"name":"ACM Transactions on Architecture and Code Optimization","volume":"27 1","pages":""},"PeriodicalIF":1.6,"publicationDate":"2023-08-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"77083358","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
RACE: An Efficient Redundancy-aware Accelerator for Dynamic Graph Neural Network RACE:一种高效的动态图神经网络冗余感知加速器
IF 1.6 3区 计算机科学
ACM Transactions on Architecture and Code Optimization Pub Date : 2023-08-30 DOI: 10.1145/3617685
Hui Yu, Yu Zhang, Jin Zhao, Yujian Liao, Zhiying Huang, Donghao He, Lin Gu, Hai Jin, Xiaofei Liao, Haikun Liu, Bin He, Jianhui Yue
{"title":"RACE: An Efficient Redundancy-aware Accelerator for Dynamic Graph Neural Network","authors":"Hui Yu, Yu Zhang, Jin Zhao, Yujian Liao, Zhiying Huang, Donghao He, Lin Gu, Hai Jin, Xiaofei Liao, Haikun Liu, Bin He, Jianhui Yue","doi":"10.1145/3617685","DOIUrl":"https://doi.org/10.1145/3617685","url":null,"abstract":"Dynamic Graph Neural Network (DGNN) has recently attracted a significant amount of research attention from various domains, because most real-world graphs are inherently dynamic. Despite of many research effort, for DGNN, existing hardware/software solutions still suffer significantly from redundant computation and memory access overhead, because they need to irregularly access and recompute all graph data of each graph snapshot. To address these issues, we propose an efficient redundancy-aware accelerator, RACE, which enables energy-efficient execution of DGNN models. Specifically, we propose a redundancy-aware incremental execution approach into the accelerator design for DGNN to instantly achieve the output features of the latest graph snapshot by correctly and incrementally refining the output features of the previous graph snapshot and also enable regular accesses of vertices’ input features. Through traversing the graph on the fly, RACE identifies the vertices which are not affected by graph updates between successive snapshots to reuse these vertices’ states (i.e., their output features) of the previous snapshot for the processing of the latest snapshot. The vertices affected by graph updates are also tracked to incrementally recompute their new states using their neighbors’ input features of the latest snapshot for correctness. In this way, the processing and accessing of many graph data which are not affected by graph updates can be correctly eliminated, enabling smaller redundant computation and memory access overhead. Besides, the input features, which are accessed more frequently, are dynamically identified according to graph topology and are preferentially resident in the on-chip memory for less off-chip communications. Experimental results show that RACE achieves on average 1139x and 84.7x speedups for DGNN inference, with average 2242x and 234.2x energy savings, in comparison with the state-of-the-art software DGNN running on Intel Xeon CPU and NVIDIA A100 GPU, respectively. Moreover, for DGNN inference, RACE obtains on average 13.1x, 11.7x, 10.4x, 7.9x speedup and average 14.8x, 12.9x, 11.5x, 8.9x energy saving over the state-of-the-art GNN accelerators, i.e., AWB-GCN, GCNAX, ReGNN, and I-GCN, respectively.","PeriodicalId":50920,"journal":{"name":"ACM Transactions on Architecture and Code Optimization","volume":"13 1","pages":""},"PeriodicalIF":1.6,"publicationDate":"2023-08-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"76021646","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Improving Computation and Memory Efficiency for Real-world Transformer Inference on GPUs 提高gpu上真实世界变压器推理的计算和内存效率
IF 1.6 3区 计算机科学
ACM Transactions on Architecture and Code Optimization Pub Date : 2023-08-26 DOI: 10.1145/3617689
Jiangsu Du, Jiazhi Jiang, Jiang Zheng, Hongbin Zhang, Dan-E Huang, Yutong Lu
{"title":"Improving Computation and Memory Efficiency for Real-world Transformer Inference on GPUs","authors":"Jiangsu Du, Jiazhi Jiang, Jiang Zheng, Hongbin Zhang, Dan-E Huang, Yutong Lu","doi":"10.1145/3617689","DOIUrl":"https://doi.org/10.1145/3617689","url":null,"abstract":"Transformer models have emerged as a leading approach in the field of natural language processing (NLP) and are increasingly being deployed in production environments. Graphic processing units (GPUs) have become a popular choice for the transformer deployment, and often rely on the batch processing technique to ensure high hardware performance. Nonetheless, the current practice for transformer inference encounters computational and memory redundancy due to the heavy-tailed distribution of sequence lengths in NLP scenarios, resulting in low practical performance. In this paper, we propose a unified solution for improving both computation and memory efficiency of the real-world transformer inference on GPUs. The solution eliminates the redundant computation and memory footprint across a transformer model. At first, a GPU-oriented computation approach is proposed to process the self-attention module in a fine-grained manner, eliminating its redundant computation. Next, the multi-layer perceptron module continues to use the word-accumulation approach to eliminate its redundant computation. Then, to better unify the fine-grained approach and the word-accumulation approach, it organizes the data layout of the self-attention module in block granularity. Since aforementioned approaches make the required memory size largely reduce and constantly fluctuate, we propose the chunk-based approach to enable a better balance between memory footprint and allocation/free efficiency. Our experimental results show that our unified solution achieves a decrease of average latency by 28 (% ) on the entire transformer model, 63.8 (% ) on the self-attention module and reduces memory footprint of intermediate results by 7.8 ×, compared with prevailing frameworks.","PeriodicalId":50920,"journal":{"name":"ACM Transactions on Architecture and Code Optimization","volume":"54 1","pages":""},"PeriodicalIF":1.6,"publicationDate":"2023-08-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"83275360","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Approx-RM: Reducing Energy on Heterogeneous Multicore Processors under Accuracy and Timing Constraints 近似rm:在精度和时间约束下减少异构多核处理器的能量
IF 1.6 3区 计算机科学
ACM Transactions on Architecture and Code Optimization Pub Date : 2023-07-22 DOI: https://dl.acm.org/doi/10.1145/3605214
Muhammad Waqar Azhar, Madhavan Manivannan, Per Stenström
{"title":"Approx-RM: Reducing Energy on Heterogeneous Multicore Processors under Accuracy and Timing Constraints","authors":"Muhammad Waqar Azhar, Madhavan Manivannan, Per Stenström","doi":"https://dl.acm.org/doi/10.1145/3605214","DOIUrl":"https://doi.org/https://dl.acm.org/doi/10.1145/3605214","url":null,"abstract":"<p>Reducing energy consumption while providing performance and quality guarantees is crucial for computing systems ranging from battery-powered embedded systems to data centers. This article considers approximate iterative applications executing on heterogeneous multi-core platforms under user-specified performance and quality targets. We note that allowing a slight yet bounded relaxation in solution quality can considerably reduce the required iteration count and thereby can save significant amounts of energy. To this end, this article proposes <i>Approx-RM</i>, a resource management scheme that reduces energy expenditure while guaranteeing a specified performance <i>as well as</i> accuracy target. <i>Approx-RM</i> predicts the number of iterations required to meet the relaxed accuracy target at runtime. The time saved generates execution-time slack, which allows <i>Approx-RM</i> to allocate fewer resources on a heterogeneous multi-core platform in terms of DVFS, core type, and core count to save energy while meeting the performance target. <i>Approx-RM</i> contributes with lightweight methods for predicting the iteration count needed to meet the accuracy target and the resources needed to meet the performance target. <i>Approx-RM</i> uses the aforementioned predictions to allocate <i>just enough</i> resources to comply with quality of service constraints to save energy. Our evaluation shows energy savings of 31.6%, on average, compared to <i>Race-to-idle</i> when the accuracy is only relaxed by 1%. <i>Approx-RM</i> incurs timing and energy overheads of less than 0.1%.</p>","PeriodicalId":50920,"journal":{"name":"ACM Transactions on Architecture and Code Optimization","volume":"75 4 1","pages":""},"PeriodicalIF":1.6,"publicationDate":"2023-07-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138529908","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
MFFT: A GPU Accelerated Highly Efficient Mixed-Precision Large-Scale FFT Framework 一个GPU加速的高效混合精度大规模FFT框架
IF 1.6 3区 计算机科学
ACM Transactions on Architecture and Code Optimization Pub Date : 2023-07-22 DOI: https://dl.acm.org/doi/10.1145/3605148
Yuwen Zhao, Fangfang Liu, Wenjing Ma, Huiyuan Li, Yuanchi Peng, Cui Wang
{"title":"MFFT: A GPU Accelerated Highly Efficient Mixed-Precision Large-Scale FFT Framework","authors":"Yuwen Zhao, Fangfang Liu, Wenjing Ma, Huiyuan Li, Yuanchi Peng, Cui Wang","doi":"https://dl.acm.org/doi/10.1145/3605148","DOIUrl":"https://doi.org/https://dl.acm.org/doi/10.1145/3605148","url":null,"abstract":"<p>Fast Fourier transform (FFT) is widely used in computing applications in large-scale parallel programs, and data communication is the main performance bottleneck of FFT and seriously affects its parallel efficiency. To tackle this problem, we propose a new large-scale FFT framework, MFFT, which optimizes parallel FFT with a new mixed-precision optimization technique, adopting the “high precision computation, low precision communication” strategy. To enable “low precision communication”, we propose a shared-exponent floating-point number compression technique, which reduces the volume of data communication, while maintaining higher accuracy. In addition, we apply a two-phase normalization technique to further reduce the round-off error. Based on the mixed-precision MFFT framework, we apply several optimization techniques to improve the performance, such as streaming of GPU kernels, MPI message combination, kernel optimization, and memory optimization. We evaluate MFFT on a system with 4,096 GPUs. The results show that shared-exponent MFFT is 1.23 × faster than that of double-precision MFFT on average, and double-precision MFFT achieves performance 3.53× and 9.48× on average higher than open source library 2Decomp&amp;FFT (CPU-based version) and heFFTe (AMD GPU-based version), respectively. The parallel efficiency of double-precision MFFT increased from 53.2% to 78.1% compared with 2Decomp&amp;FFT, and shared-exponent MFFT further increases the parallel efficiency to 83.8%.</p>","PeriodicalId":50920,"journal":{"name":"ACM Transactions on Architecture and Code Optimization","volume":"5 1","pages":""},"PeriodicalIF":1.6,"publicationDate":"2023-07-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138529813","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Accelerating Convolutional Neural Network by Exploiting Sparsity on GPUs 利用gpu上的稀疏性加速卷积神经网络
IF 1.6 3区 计算机科学
ACM Transactions on Architecture and Code Optimization Pub Date : 2023-07-19 DOI: https://dl.acm.org/doi/10.1145/3600092
Weizhi Xu, Yintai Sun, Shengyu Fan, Hui Yu, Xin Fu
{"title":"Accelerating Convolutional Neural Network by Exploiting Sparsity on GPUs","authors":"Weizhi Xu, Yintai Sun, Shengyu Fan, Hui Yu, Xin Fu","doi":"https://dl.acm.org/doi/10.1145/3600092","DOIUrl":"https://doi.org/https://dl.acm.org/doi/10.1145/3600092","url":null,"abstract":"<p>The convolutional neural network (CNN) is an important deep learning method, which is widely used in many fields. However, it is very time consuming to implement the CNN where convolution usually takes most of the time. There are many zero values in feature maps and filters, which leads to redundant calculations and memory accesses if dense methods are used to compute convolution. Many works recently have made use of sparsity to skip the calculations for zero values to reduce the inference time of the CNN. On the graphics processing unit platform, current works cannot fully exploit the sparsity of the feature map and achieve satisfactory performance. Therefore, we design a new parallel strategy to transform the feature map into a new storage format to avoid the redundant computation of zero values on graphics processing units. Also considering the sparsity in the feature map, we propose a fused storage format to combine the convolution operation with the following pooling operation, to further improve the performance. We carry out experiments with mainstream CNN models and achieve better performance compared with cuDNN and cuSPARSE. For VGG-19, ResNet-50, DenseNet-121, and RegNetX-16GF, 1.97×, 2.23×, 2.74×, and 1.58× speedups respectively are obtained over cuDNN. The speedups over cuSPARSE respectively are 2.10×, 1.83×, 2.35×, and 1.35× when only using the first method.</p>","PeriodicalId":50920,"journal":{"name":"ACM Transactions on Architecture and Code Optimization","volume":"875 1","pages":""},"PeriodicalIF":1.6,"publicationDate":"2023-07-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138529849","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
rNdN: Fast Query Compilation for NVIDIA GPUs rNdN: NVIDIA gpu快速查询编译
IF 1.6 3区 计算机科学
ACM Transactions on Architecture and Code Optimization Pub Date : 2023-07-19 DOI: https://dl.acm.org/doi/10.1145/3603503
Alexander Krolik, Clark Verbrugge, Laurie Hendren
{"title":"rNdN: Fast Query Compilation for NVIDIA GPUs","authors":"Alexander Krolik, Clark Verbrugge, Laurie Hendren","doi":"https://dl.acm.org/doi/10.1145/3603503","DOIUrl":"https://doi.org/https://dl.acm.org/doi/10.1145/3603503","url":null,"abstract":"<p>GPU database systems are an effective solution to query optimization, particularly with compilation and data caching. They fall short, however, in end-to-end workloads, as existing compiler toolchains are too expensive for use with short-running queries. In this work, we define and evaluate a runtime-suitable query compilation pipeline for NVIDIA GPUs that extracts high performance with only minimal optimization. In particular, our balanced approach successfully trades minor slowdowns in execution for major speedups in compilation, even as data sizes increase. We demonstrate performance benefits compared to both CPU and GPU database systems using interpreters and compilers, extending query compilation for GPUs beyond cached use cases.</p>","PeriodicalId":50920,"journal":{"name":"ACM Transactions on Architecture and Code Optimization","volume":"7 1","pages":""},"PeriodicalIF":1.6,"publicationDate":"2023-07-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138509542","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
相关产品
×
本文献相关产品
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信