ACM Transactions on Architecture and Code Optimization最新文献_第7页

At the Locus of Performance: Quantifying the Effects of Copious 3D-Stacked Cache on HPC Workloads 在性能轨迹:量化丰富的3d堆叠缓存对高性能计算工作负载的影响

3区计算机科学

ACM Transactions on Architecture and Code Optimization Pub Date : 2023-10-25 DOI: 10.1145/3629520

Jens Domke, Emil Vatai, Balazs Gerofi, Yuetsu Kodama, Mohamed Wahib, Artur Podobas, Sparsh Mittal, Miquel Pericàs, Lingqi Zhang, Peng Chen, Aleksandr Drozd, Satoshi Matsuoka

{"title":"At the Locus of Performance: Quantifying the Effects of Copious 3D-Stacked Cache on HPC Workloads","authors":"Jens Domke, Emil Vatai, Balazs Gerofi, Yuetsu Kodama, Mohamed Wahib, Artur Podobas, Sparsh Mittal, Miquel Pericàs, Lingqi Zhang, Peng Chen, Aleksandr Drozd, Satoshi Matsuoka","doi":"10.1145/3629520","DOIUrl":"https://doi.org/10.1145/3629520","url":null,"abstract":"Over the last three decades, innovations in the memory subsystem were primarily targeted at overcoming the data movement bottleneck. In this paper, we focus on a specific market trend in memory technology: 3D-stacked memory and caches. We investigate the impact of extending the on-chip memory capabilities in future HPC-focused processors, particularly by 3D-stacked SRAM. First, we propose a method oblivious to the memory subsystem to gauge the upper-bound in performance improvements when data movement costs are eliminated. Then, using the gem5 simulator, we model two variants of a hypothetical LARge Cache processor (LARC), fabricated in 1.5 nm and enriched with high-capacity 3D-stacked cache. With a volume of experiments involving a broad set of proxy-applications and benchmarks, we aim to reveal how HPC CPU performance will evolve, and conclude an average boost of 9.56x for cache-sensitive HPC applications, on a per-chip basis. Additionally, we exhaustively document our methodological exploration to motivate HPC centers to drive their own technological agenda through enhanced co-design.","PeriodicalId":50920,"journal":{"name":"ACM Transactions on Architecture and Code Optimization","volume":"65 sp1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-10-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135219250","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

FlowPix: Accelerating Image Processing Pipelines on an FPGA Overlay using a Domain Specific Compiler FlowPix:使用特定领域编译器加速FPGA覆盖上的图像处理管道

3区计算机科学

ACM Transactions on Architecture and Code Optimization Pub Date : 2023-10-25 DOI: 10.1145/3629523

Ziaul Choudhury, Anish Gulati, Suresh Purini

{"title":"FlowPix: Accelerating Image Processing Pipelines on an FPGA Overlay using a Domain Specific Compiler","authors":"Ziaul Choudhury, Anish Gulati, Suresh Purini","doi":"10.1145/3629523","DOIUrl":"https://doi.org/10.1145/3629523","url":null,"abstract":"The exponential performance growth guaranteed by Moore’s law has started to taper in recent years. At the same time, emerging applications like image processing demand heavy computational performance. These factors inevitably lead to the emergence of domain-specific accelerators (DSA) to fill the performance void left by conventional architectures. FPGAs are rapidly evolving towards becoming an alternative to custom ASICs for designing DSAs because of their low power consumption and a higher degree of parallelism. DSA design on FPGAs requires careful calibration of the FPGA compute and memory resources towards achieving optimal throughput. Hardware Descriptive Languages (HDL) like Verilog have been traditionally used to design FPGA hardware. HDLs are not geared towards any domain, and the user has to put in much effort to describe the hardware at the register transfer level. Domain Specific Languages (DSLs) and compilers have been recently used to weave together handwritten HDLs templates targeting a specific domain. Recent efforts have designed DSAs with image-processing DSLs targeting FPGAs. Image computations in the DSL are lowered to pre-existing templates or lower-level languages like HLS-C. This approach requires expensive FPGA re-flashing for every new workload. In contrast to this fixed-function hardware approach, overlays are gaining traction. Overlays are DSAs resembling a processor, which is synthesized and flashed on the FPGA once but is flexible enough to process a broad class of computations through soft reconfiguration. Less work has been reported in the context of image processing overlays. Image processing algorithms vary in size and shape, ranging from simple blurring operations to complex pyramid systems. The primary challenge in designing an image-processing overlay is maintaining flexibility in mapping different algorithms. This paper proposes a DSL-based overlay accelerator called FlowPix for image processing applications. The DSL programs are expressed as pipelines, with each stage representing a computational step in the overall algorithm. We implement 15 image-processing benchmarks using FlowPix on a Virtex-7-690t FPGA. The benchmarks range from simple blur operations to complex pipelines like Lucas-Kande optical flow. We compare FlowPix against existing DSL-to-FPGA frameworks like Hetero-Halide and Vitis Vision library that generate fixed-function hardware. On most benchmarks, we see up to 25% degradation in latency with approximately a 1.7x to 2x increase in the FPGA LUT consumption. Our ability to execute any benchmark without incurring the high costs of hardware synthesis, place-and-route, and FPGA re-flashing justifies the slight performance loss and increased resource consumption that we experience. FlowPix achieves an average frame rate of 170 FPS on HD frames of 1920x1080 pixels in the implemented benchmarks.","PeriodicalId":50920,"journal":{"name":"ACM Transactions on Architecture and Code Optimization","volume":"42 4","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-10-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134973341","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Fastensor: Optimise the Tensor I/O Path from SSD to GPU for Deep Learning Training fastsensor:优化从SSD到GPU的张量I/O路径，用于深度学习训练

3区计算机科学

ACM Transactions on Architecture and Code Optimization Pub Date : 2023-10-25 DOI: 10.1145/3630108

Jia Wei, Xingjun Zhang, Longxiang Wang, Zheng Wei

{"title":"Fastensor: Optimise the Tensor I/O Path from SSD to GPU for Deep Learning Training","authors":"Jia Wei, Xingjun Zhang, Longxiang Wang, Zheng Wei","doi":"10.1145/3630108","DOIUrl":"https://doi.org/10.1145/3630108","url":null,"abstract":"In recent years, benefiting from the increase in model size and complexity, deep learning has achieved tremendous success in computer vision (CV) and natural language processing (NLP). Training deep learning models using accelerators such as GPUs often requires much iterative data to be transferred from NVMe SSD to GPU memory. Much recent work has focused on data transfer during the pre-processing phase and has introduced techniques such as multiprocessing and GPU Direct Storage (GDS) to accelerate it. However, tensor data during training (such as Checkpoints, logs, and intermediate feature maps) which is also time-consuming, is often transferred using traditional serial, long-I/O-path transfer methods. In this paper, based on GDS technology, we built Fastensor, an efficient tool for tensor data transfer between NVMe SSDs and GPUs. To achieve higher tensor data I/O throughput, we optimized the traditional data I/O process. We also proposed a data and runtime context-aware tensor I/O algorithm. Fastensor can select the most suitable data transfer tool for the current tensor from a candidate set of tools during model training. The optimal tool is derived from a dictionary generated by our adaptive exploration algorithm in the first few training iterations. We used Fastensor’s unified interface to test the read/write bandwidth and energy consumption of different transfer tools for different sizes of tensor blocks. We found that the execution efficiency of different tensor transfer tools is related to both the tensor block size and the runtime context. We then deployed Fastensor in the widely applicable Pytorch deep learning framework. We showed that Fastensor could perform superior in typical scenarios of model parameter saving and intermediate feature map transfer with the same hardware configuration. Fastensor achieves a 5.37x read performance improvement compared to torch . save () when used for model parameter saving. When used for intermediate feature map transfer, Fastensor can increase the supported training batch size by 20x, while the total read and write speed is increased by 2.96x compared to the torch I/O API.","PeriodicalId":50920,"journal":{"name":"ACM Transactions on Architecture and Code Optimization","volume":"56 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-10-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134973614","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

ULEEN: A Novel Architecture for Ultra Low-Energy Edge Neural Networks 一种新的超低能量边缘神经网络架构

3区计算机科学

ACM Transactions on Architecture and Code Optimization Pub Date : 2023-10-25 DOI: 10.1145/3629522

Zachary Susskind, Aman Arora, Igor D. S. Miranda, Alan T. L. Bacellar, Luis A. Q. Villon, Rafael F. Katopodis, Leandro S. de Araújo, Diego L. C. Dutra, Priscila M. V. Lima, Felipe M. G. França, Mauricio Breternitz Jr., Lizy K. John

{"title":"ULEEN: A Novel Architecture for Ultra Low-Energy Edge Neural Networks","authors":"Zachary Susskind, Aman Arora, Igor D. S. Miranda, Alan T. L. Bacellar, Luis A. Q. Villon, Rafael F. Katopodis, Leandro S. de Araújo, Diego L. C. Dutra, Priscila M. V. Lima, Felipe M. G. França, Mauricio Breternitz Jr., Lizy K. John","doi":"10.1145/3629522","DOIUrl":"https://doi.org/10.1145/3629522","url":null,"abstract":"”Extreme edge“ devices such as smart sensors are a uniquely challenging environment for the deployment of machine learning. The tiny energy budgets of these devices lie beyond what is feasible for conventional deep neural networks, particularly in high-throughput scenarios, requiring us to rethink how we approach edge inference. In this work, we propose ULEEN, a model and FPGA-based accelerator architecture based on weightless neural networks (WNNs). WNNs eliminate energy-intensive arithmetic operations, instead using table lookups to perform computation, which makes them theoretically well-suited for edge inference. However, WNNs have historically suffered from poor accuracy and excessive memory usage. ULEEN incorporates algorithmic improvements and a novel training strategy inspired by binary neural networks (BNNs) to make significant strides in addressing these issues. We compare ULEEN against BNNs in software and hardware using the four MLPerf Tiny datasets and MNIST. Our FPGA implementations of ULEEN accomplish classification at 4.0-14.3 million inferences per second, improving area-normalized throughput by an average of 3.6 × and steady-state energy efficiency by an average of 7.1 × compared to the FPGA-based Xilinx FINN BNN inference platform. While ULEEN is not a universally applicable machine learning model, we demonstrate that it can be an excellent choice for certain applications in energy- and latency-critical edge environments.","PeriodicalId":50920,"journal":{"name":"ACM Transactions on Architecture and Code Optimization","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-10-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134973617","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Efficient Cross-platform Multiplexing of Hardware Performance Counters via Adaptive Grouping 基于自适应分组的硬件性能计数器高效跨平台复用

3区计算机科学

ACM Transactions on Architecture and Code Optimization Pub Date : 2023-10-21 DOI: 10.1145/3629525

Tong-yu Liu, Jianmei Guo, Bo Huang

{"title":"Efficient Cross-platform Multiplexing of Hardware Performance Counters via Adaptive Grouping","authors":"Tong-yu Liu, Jianmei Guo, Bo Huang","doi":"10.1145/3629525","DOIUrl":"https://doi.org/10.1145/3629525","url":null,"abstract":"Collecting sufficient microarchitecture performance data is essential for performance evaluation and workload characterization. There are many events to be monitored in a modern processor while only a few hardware performance monitoring counters (PMCs) can be used, so multiplexing is commonly adopted. However, inefficiency commonly exists in state-of-the-art profiling tools when grouping events for multiplexing PMCs. It has the risk of inaccurate measurement and misleading analysis. Commercial tools can leverage PMCs but they are closed-source and only support their specified platforms. To this end, we propose an approach for efficient cross-platform microarchitecture performance measurement via adaptive grouping, aiming to improve the metrics’ sampling ratios. The approach generates event groups based on the number of available PMCs detected on arbitrary machines while avoiding the scheduling pitfall of Linux perf_event subsystem. We evaluate our approach with SPEC CPU 2017 on four mainstream x86-64 and AArch64 processors and conduct comparative analyses of efficiency with two other state-of-the-art tools, LIKWID and ARM Top-down Tool. The experimental results indicate that our approach gains around 50% improvement in the average sampling ratio of metrics without compromising the correctness and reliability.","PeriodicalId":50920,"journal":{"name":"ACM Transactions on Architecture and Code Optimization","volume":"75 6","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-10-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135511089","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Mapi-Pro: An Energy Efficient Memory Mapping Technique for Intermittent Computing Mapi-Pro:一种用于间歇性计算的高能效内存映射技术

3区计算机科学

ACM Transactions on Architecture and Code Optimization Pub Date : 2023-10-20 DOI: 10.1145/3629524

Satya Jaswanth Badri, Mukesh Saini, Neeraj Goel

{"title":"Mapi-Pro: An Energy Efficient Memory Mapping Technique for Intermittent Computing","authors":"Satya Jaswanth Badri, Mukesh Saini, Neeraj Goel","doi":"10.1145/3629524","DOIUrl":"https://doi.org/10.1145/3629524","url":null,"abstract":"Battery-less technology evolved to replace battery usage in space, deep mines, and other environments to reduce cost and pollution. Non-volatile memory (NVM) based processors were explored for saving the system state during a power failure. Such devices have a small SRAM and large non-volatile memory. To make the system energy efficient, we need to use SRAM efficiently. So we must select some portions of the application and map them to either SRAM or FRAM. This paper proposes an ILP-based memory mapping technique for intermittently powered IoT devices. Our proposed technique gives an optimal mapping choice that reduces the system’s Energy-Delay Product (EDP). We validated our system using TI-based MSP430FR6989 and MSP430F5529 development boards. Our proposed memory configuration consumes 38.10% less EDP than the baseline configuration and 9.30% less EDP than the existing work under stable power. Our proposed configuration achieves 20.15% less EDP than the baseline configuration and 26.87% less EDP than the existing work under unstable power. This work supports intermittent computing and works efficiently during frequent power failures.","PeriodicalId":50920,"journal":{"name":"ACM Transactions on Architecture and Code Optimization","volume":"7 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-10-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135618053","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Characterizing Multi-Chip GPU Data Sharing 多芯片GPU数据共享特性研究

3区计算机科学

ACM Transactions on Architecture and Code Optimization Pub Date : 2023-10-20 DOI: 10.1145/3629521

Shiqing Zhang, Mahmood Naderan-Tahan, Magnus Jahre, Lieven Eeckhout

{"title":"Characterizing Multi-Chip GPU Data Sharing","authors":"Shiqing Zhang, Mahmood Naderan-Tahan, Magnus Jahre, Lieven Eeckhout","doi":"10.1145/3629521","DOIUrl":"https://doi.org/10.1145/3629521","url":null,"abstract":"Multi-chip GPU systems are critical to scale performance beyond a single GPU chip for a wide variety of important emerging applications. A key challenge for multi-chip GPUs though is how to overcome the bandwidth gap between inter-chip and intra-chip communication. Accesses to shared data, i.e., data accessed by multiple chips, pose a major performance challenge as they incur remote memory accesses possibly congesting the inter-chip links and degrading overall system performance. This paper characterizes the shared data set in multi-chip GPUs in terms of (1) truly versus falsely shared data, (2) how the shared data set scales with input size, (3) along which dimensions the shared data set scales, and (4) how sensitive the shared data set is with respect to the input’s characteristics, i.e., node degree and connectivity in graph workloads. We observe significant variety in scaling behavior across workloads: some workloads feature a shared data set that scales linearly with input size, while others feature sublinear scaling (following a (sqrt {2} ) or (sqrt [3]{2} ) relationship). We further demonstrate how the shared data set affects the optimum last-level cache organization (memory-side versus SM-side) in multi-chip GPUs, as well as optimum memory page allocation and thread scheduling policy. Sensitivity analyses demonstrate the insights across the broad design space.","PeriodicalId":50920,"journal":{"name":"ACM Transactions on Architecture and Code Optimization","volume":"33 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-10-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135618434","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

DxPU: Large Scale Disaggregated GPU Pools in the Datacenter DxPU:数据中心的大规模解聚GPU池

3区计算机科学

ACM Transactions on Architecture and Code Optimization Pub Date : 2023-10-05 DOI: 10.1145/3617995

Bowen He, Xiao Zheng, Yuan Chen, Weinan Li, Yajin Zhou, Xin Long, Pengcheng Zhang, Xiaowei Lu, Linquan Jiang, Qiang Liu, Dennis Cai, Xiantao Zhang

{"title":"DxPU: Large Scale Disaggregated GPU Pools in the Datacenter","authors":"Bowen He, Xiao Zheng, Yuan Chen, Weinan Li, Yajin Zhou, Xin Long, Pengcheng Zhang, Xiaowei Lu, Linquan Jiang, Qiang Liu, Dennis Cai, Xiantao Zhang","doi":"10.1145/3617995","DOIUrl":"https://doi.org/10.1145/3617995","url":null,"abstract":"The rapid adoption of AI and convenience offered by cloud services have resulted in the growing demands for GPUs in the cloud. Generally, GPUs are physically attached to host servers as PCIe devices. However, the fixed assembly combination of host servers and GPUs is extremely inefficient in resource utilization, upgrade, and maintenance. Due to these issues, the GPU disaggregation technique has been proposed to decouple GPUs from host servers. It aggregates GPUs into a pool, and allocates GPU node(s) according to user demands. However, existing GPU disaggregation systems have flaws in software-hardware compatibility, disaggregation scope, and capacity. In this paper, we present a new implementation of datacenter-scale GPU disaggregation, named DxPU. DxPU efficiently solves the above problems and can flexibly allocate as many GPU node(s) as users demand. In order to understand the performance overhead incurred by DxPU, we build up a performance model for AI specific workloads. With the guidance of modeling results, we develop a prototype system, which has been deployed into the datacenter of a leading cloud provider for a test run. We also conduct detailed experiments to evaluate the performance overhead caused by our system. The results show that the overhead of DxPU is less than 10%, compared with native GPU servers, in most of user scenarios.","PeriodicalId":50920,"journal":{"name":"ACM Transactions on Architecture and Code Optimization","volume":"36 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-10-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135481968","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

gPPM: A Generalized Matrix Operation and Parallel Algorithm to Accelerate the Encoding/Decoding Process of Erasure Codes gPPM:一种加速Erasure码编码/解码过程的广义矩阵运算和并行算法

3区计算机科学

ACM Transactions on Architecture and Code Optimization Pub Date : 2023-09-21 DOI: 10.1145/3625005

Shiyi Li, Qiang Cao, Shenggang Wan, Wen Xia, Changsheng Xie

{"title":"gPPM: A Generalized Matrix Operation and Parallel Algorithm to Accelerate the Encoding/Decoding Process of Erasure Codes","authors":"Shiyi Li, Qiang Cao, Shenggang Wan, Wen Xia, Changsheng Xie","doi":"10.1145/3625005","DOIUrl":"https://doi.org/10.1145/3625005","url":null,"abstract":"Erasure codes are widely deployed in modern storage systems, leading to frequent usage of their encoding/decoding operations. The encoding/decoding process for erasure codes is generally carried out using the parity-check matrix approach. However, this approach is serial and computationally expensive, mainly due to dealing with matrix operations, which results in low encoding/decoding performance. These drawbacks are particularly evident for newer erasure codes, including SD and LRC codes. To address these limitations, this paper introduces the Partitioned and Parallel Matrix ( PPM ) algorithm. This algorithm partitions the parity-check matrix, parallelizes encoding/decoding operations, and optimizes calculation sequence to facilitate fast encoding/decoding of these codes. Furthermore, we present a generalized PPM ( gPPM ) algorithm that surpasses PPM in performance by employing fine-grained dynamic matrix calculation sequence selection. Unlike PPM, gPPM is also applicable to erasure codes such as RS code. Experimental results demonstrate that PPM improves the encoding/decoding speed of SD and LRC codes by up to (210.81% ) . Besides, gPPM achieves up to (102.41% ) improvement over PPM and (32.25% ) improvement over RS regarding encoding/decoding speed.","PeriodicalId":50920,"journal":{"name":"ACM Transactions on Architecture and Code Optimization","volume":"9 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-09-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"136235362","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Advancing Direct Convolution using Convolution Slicing Optimization and ISA Extensions 使用卷积切片优化和ISA扩展推进直接卷积

3区计算机科学

ACM Transactions on Architecture and Code Optimization Pub Date : 2023-09-20 DOI: 10.1145/3625004

Victor Ferrari, Rafael Sousa, Marcio Pereira, João P. L. de Carvalho, José Nelson Amaral, José Moreira, Guido Araujo

{"title":"Advancing Direct Convolution using Convolution Slicing Optimization and ISA Extensions","authors":"Victor Ferrari, Rafael Sousa, Marcio Pereira, João P. L. de Carvalho, José Nelson Amaral, José Moreira, Guido Araujo","doi":"10.1145/3625004","DOIUrl":"https://doi.org/10.1145/3625004","url":null,"abstract":"Convolution is one of the most computationally intensive operations that must be performed for machine-learning model inference. A traditional approach to computing convolutions is known as the Im2Col + BLAS method. This paper proposes SConv: a direct-convolution algorithm based on an MLIR/LLVM code-generation toolchain that can be integrated into machine-learning compilers. This algorithm introduces: (a) Convolution Slicing Analysis (CSA) — a convolution-specific 3D cache-blocking analysis pass that focuses on tile reuse over the cache hierarchy; (b) Convolution Slicing Optimization (CSO) — a code-generation pass that uses CSA to generate a tiled direct-convolution macro-kernel; and (c) Vector-Based Packing (VBP) — an architecture-specific optimized input-tensor packing solution based on vector-register shift instructions for convolutions with unitary stride. Experiments conducted on 393 convolutions from full ONNX-MLIR machine-learning models indicate that the elimination of the Im2Col transformation and the use of fast packing routines result in a total packing time reduction, on full model inference, of 2.3x – 4.0x on Intel x86 and 3.3x – 5.9x on IBM POWER10. The speed-up over an Im2Col + BLAS method based on current BLAS implementations for end-to-end machine-learning model inference is in the range of 11% – 27% for Intel x86 and 11% – 34% for IBM POWER10 architectures. The total convolution speedup for model inference is 13% – 28% on Intel x86 and 23% – 39% on IBM POWER10. SConv also outperforms BLAS GEMM, when computing pointwise convolutions in more than 82% of the 219 tested instances.","PeriodicalId":50920,"journal":{"name":"ACM Transactions on Architecture and Code Optimization","volume":"27 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-09-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"136313810","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1