Zhengyan Liu, Qiang Liu, Shun Yan, Ray C.C. Cheung
{"title":"An Efficient FPGA-based Depthwise Separable Convolutional Neural Network Accelerator with Hardware Pruning","authors":"Zhengyan Liu, Qiang Liu, Shun Yan, Ray C.C. Cheung","doi":"10.1145/3615661","DOIUrl":"https://doi.org/10.1145/3615661","url":null,"abstract":"Convolutional neural networks (CNNs) have been widely deployed in computer vision tasks. However, the computation and resource intensive characteristics of CNN bring obstacles to its application on embedded systems. This paper proposes an efficient inference accelerator on FPGA for CNNs with depthwise separable convolutions (DSCs). To improve the accelerator efficiency, we make four contributions: (1) an efficient convolution engine with multiple strategies for exploiting parallelism and a configurable adder tree are designed to support three types of convolution operations; (2) a dedicated architecture combined with input buffers is designed for the bottleneck network structure to reduce data transmission time; (3) a hardware padding scheme to eliminate invalid padding operations is proposed; (4) a hardware-assisted pruning method is developed to support online trade-off between model accuracy and power consumption. Experimental results show that for MobileNetV2 the accelerator achieves 10x and 6x energy efficiency improvement over the CPU and GPU implementation, and 302.3 FPS and 181.8 GOPS performance which is the best among several existing single-engine accelerators on FPGAs. The proposed hardware-assisted pruning method can effectively reduce 59.7% power consumption at the accuracy loss within 5%.","PeriodicalId":49248,"journal":{"name":"ACM Transactions on Reconfigurable Technology and Systems","volume":"38 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-09-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135736422","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Xijie Jia, Yu Zhang, Guangdong Liu, Xinlin Yang, Tianyu Zhang, Jia Zheng, Dongdong Xu, Zhuohuan Liu, Mengke Liu, Xiaoyang Yan, Hong Wang, Rongzhang Zheng, Li Wang, Dong Li, Satyaprakash Pareek, Jian Weng, Lu Tian, Dongliang Xie, Hong Luo, Yi Shan
{"title":"XVDPU: A High Performance CNN Accelerator on Versal Platform Powered by AI Engine","authors":"Xijie Jia, Yu Zhang, Guangdong Liu, Xinlin Yang, Tianyu Zhang, Jia Zheng, Dongdong Xu, Zhuohuan Liu, Mengke Liu, Xiaoyang Yan, Hong Wang, Rongzhang Zheng, Li Wang, Dong Li, Satyaprakash Pareek, Jian Weng, Lu Tian, Dongliang Xie, Hong Luo, Yi Shan","doi":"10.1145/3617836","DOIUrl":"https://doi.org/10.1145/3617836","url":null,"abstract":"Nowadays, convolution neural networks (CNNs) are widely used in computer vision applications. However, the trends of higher accuracy and higher resolution generate larger networks. The requirements of computation or I/O are the key bottlenecks. In this paper, we propose XVDPU: the AI-Engine (AIE)-based CNN accelerator on Versal chips to meet heavy computation requirements. To resolve IO bottleneck, we adopt several techniques to improve data-reuse and reduce I/O requirements. An Arithmetic Logic Unit (ALU) is further proposed which can better balance resource utilization, new feature support, and efficiency of the whole system. We have successfully deployed more than 100 CNN models with our accelerator. Our experimental results show that the 96-AIE-core implementation can achieve 1653 frames per second (FPS) for ResNet50 on VCK190, which is 9.8 × faster than the design on ZCU102 running at 168.5 FPS. The 256-AIE-core implementation can further achieve 4050 FPS. We propose a tilling strategy to achieve feature-map-stationary (FMS) for high-definition CNN (HD-CNN) with the accelerator, achieving 3.8 × FPS improvement on Residual Channel Attention Network (RCAN) and 3.1 × on Super-Efficient Super-Resolution (SESR). This accelerator can also solve the 3D convolution task in disparity estimation, achieving end-to-end (E2E) performance of 10.1FPS with all the optimizations.","PeriodicalId":49248,"journal":{"name":"ACM Transactions on Reconfigurable Technology and Systems","volume":"170 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-09-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135736217","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"CHIP-KNNv2: A <u>C</u> onfigurable and <u>Hi</u> gh- <u>P</u> erformance <u>K</u> - <u>N</u> earest <u>N</u> eighbors Accelerator on HBM-based FPGAs","authors":"Kenneth Liu, Alec Lu, Kartik Samtani, Zhenman Fang, Licheng Guo","doi":"10.1145/3616873","DOIUrl":"https://doi.org/10.1145/3616873","url":null,"abstract":"The k-nearest neighbors (KNN) algorithm is an essential algorithm in many applications, such as similarity search, image classification, and database query. With the rapid growth in the dataset size and the feature dimension of each data point, processing KNN becomes more compute and memory hungry. Most prior studies focus on accelerating the computation of KNN using the abundant parallel resource on FPGAs. However, they often overlook the memory access optimizations on FPGA platforms and only achieve a marginal speedup over a multi-thread CPU implementation for large datasets. In this paper, we design and implement CHIP-KNN: an HLS-based, configurable, and high-performance KNN accelerator. CHIP-KNN optimizes the off-chip memory access on modern HBM-based FPGAs such as the AMD/Xilinx Alveo U280 FPGA board. CHIP-KNN is configurable for all essential parameters used in the algorithm, including the size of the search dataset, the feature dimension and data type representation of each data point, the distance metric, and the number of nearest neighbors - K. In terms of design architecture, we explore and discuss the trade-offs between two design versions: CHIP-KNNv1 (Ping-Pong buffer based) and CHIP-KNNv2 (streaming-based). Moreover, we investigate the routing congestion issue in our accelerator design, implement hierarchical structures to shorten critical paths, and integrate an open-source floorplanning optimization tool called TAPA/AutoBridge to eliminate the place-and-route issues. To explore the design space and balance the computation and memory access performance, we also build an analytical performance model. Given a user configuration of the KNN parameters, our tool can automatically generate TAPA HLS C code for the optimal accelerator design and the corresponding host code, on the HBM-based FPGA platform. Our experimental results on the Alveo U280 show that, compared to a 48-thread CPU implementation, CHIP-KNNv2 achieves a geomean performance speedup of 15x, with a maximum speedup of 45x. Additionally, we show that CHIP-KNNv2 achieves up to 2.1x performance speedup over CHIP-KNNv1 while increasing configurability. Compared with the state-of-the-art Facebook AI Similarity Search (FAISS) [23] GPU implementation running on a Nvidia Tesla V100 GPU, CHIP-KNNv2 achieves an average latency reduction of 30.6x while requiring 34.3% of GPU power consumption.","PeriodicalId":49248,"journal":{"name":"ACM Transactions on Reconfigurable Technology and Systems","volume":"106 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-09-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135735617","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A Hardware Accelerator for the Semi-Global Matching Stereo Algorithm","authors":"J. Kalomiros, J. Vourvoulakis, S. Vologiannidis","doi":"10.1145/3615869","DOIUrl":"https://doi.org/10.1145/3615869","url":null,"abstract":"The semi-global matching stereo algorithm is a top performing algorithm in stereo vision. The recursive nature of the computations involved in this algorithm introduces an inherent data dependency problem, hindering the progressive computations of disparities at pixel clock. In this work, a novel hardware implementation of the semi-global matching algorithm is presented. A hardware structure of parallel comparators is proposed for the fast computation of the minima among large cost arrays in one clock cycle. Also, a hardware-friendly algorithm is proposed for the computation of the minima among far-indexed disparities, shortening the length of computations in the datapath. As a result, the recursive path cost computation is accelerated considerably. The system is implemented in a Stratix V device and in a Zynq UltraScale+ device. A throughput of 55,1 million disparities per second is achieved with maximum disparity 128 pixels and frame resolution 1280 × 720. The proposed architecture is less elaborate and more resource efficient than other systems in the literature and its performance compares favorably to them. An implementation on an actual FPGA board is also presented and serves as a real-world verification of the proposed system.","PeriodicalId":49248,"journal":{"name":"ACM Transactions on Reconfigurable Technology and Systems","volume":" ","pages":""},"PeriodicalIF":2.3,"publicationDate":"2023-09-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"48268611","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"FPGA-based Deep Learning Inference Accelerators: Where Are We Standing?","authors":"Anouar Nechi, Lukas Groth, Saleh Mulhem, Farhad Merchant, R. Buchty, Mladen Berekovic","doi":"10.1145/3613963","DOIUrl":"https://doi.org/10.1145/3613963","url":null,"abstract":"Recently, artificial intelligence applications have become part of almost all emerging technologies around us. Neural networks, in particular, have shown significant advantages and have been widely adopted over other approaches in machine learning. In this context, high processing power is deemed a fundamental challenge and a persistent requirement. Recent solutions facing such a challenge deploy hardware platforms to provide high computing performance for neural networks and deep learning algorithms. This direction is also rapidly taking over the market. Here, FPGAs occupy the middle ground regarding flexibility, reconfigurability, and efficiency compared to general-purpose CPUs, GPUs, on one side, and manufactured ASICs on the other. FPGA-based accelerators exploit the features of FPGAs to increase the computing performance for specific algorithms and algorithm features. Filling a gap, we provide holistic benchmarking criteria and optimization techniques that work across several classes of deep learning implementations. This paper summarizes the current state of deep learning hardware acceleration: More than 120 FPGA-based neural network accelerator designs are presented and evaluated based on a matrix of performance and acceleration criteria, and corresponding optimization techniques are presented and discussed. In addition, the evaluation criteria and optimization techniques are demonstrated by benchmarking ResNet-2 and LSTM-based accelerators.","PeriodicalId":49248,"journal":{"name":"ACM Transactions on Reconfigurable Technology and Systems","volume":" ","pages":""},"PeriodicalIF":2.3,"publicationDate":"2023-09-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"44121298","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"BLOOP: Boolean Satisfiability-based Optimized Loop Pipelining","authors":"Nicolai Fiege, P. Zipf","doi":"10.1145/3599972","DOIUrl":"https://doi.org/10.1145/3599972","url":null,"abstract":"Modulo scheduling is the premier technique for throughput maximization of loops in high-level synthesis by interleaving consecutive loop iterations. The number of clock cycles between data insertions is called the initiation interval (II). For throughput maximization, this value should be as low as possible; therefore, its minimization is the main optimization goal. Despite its long historical existence, modulo scheduling always remained a relevant research topic over the years with many exact and heuristic algorithms available in the literature. Nevertheless, we are able to leverage the scalability of modern Boolean Satisfiability (SAT) solvers to outperform state-of-the-art ILP-based algorithms for latency-optimal modulo scheduling for both integer and rational IIs. Our algorithm is able to compute valid modulo schedules for the whole CHStone and MachSuite benchmark suites, with 99% of the solutions being proven to be throughput optimal for a timeout of only 10 minutes per candidate II. For various time limits, not a single tested scheduler from the state of the art is able to compute more verified optimal solutions or even a single schedule with a higher throughput than our proposed approach. Using an HLS toolflow, we show that our algorithm can be effectively used to generate Pareto-optimal FPGA implementations regarding throughput and resource usage.","PeriodicalId":49248,"journal":{"name":"ACM Transactions on Reconfigurable Technology and Systems","volume":"16 1","pages":"1 - 32"},"PeriodicalIF":2.3,"publicationDate":"2023-07-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"43029030","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Topgun: An ECC Accelerator for Private Set Intersection","authors":"Guiming Wu, Qianwen He, Jiali Jiang, Zhenxiang Zhang, Yuan Zhao, Yinchao Zou, Jie Zhang, Changzheng Wei, Ying Yan, Hui Zhang","doi":"https://dl.acm.org/doi/10.1145/3603114","DOIUrl":"https://doi.org/https://dl.acm.org/doi/10.1145/3603114","url":null,"abstract":"<p>Elliptic Curve Cryptography (ECC), one of the most widely used asymmetric cryptographic algorithms, has been deployed in Transport Layer Security (TLS) protocol, blockchain, secure multiparty computation, etc. As one of the most secure ECC curves, Curve25519 is employed by some secure protocols, such as TLS 1.3 and Diffie-Hellman Private Set Intersection (DH-PSI) protocol. High performance implementation of ECC is required, especially for the DH-PSI protocol used in privacy-preserving platform. </p><p>Point multiplication, the chief cryptographic primitive in ECC, is computationally expensive. To improve the performance of DH-PSI protocol, we propose Topgun, a novel and high-performance hardware architecture for point multiplication over Curve25519. The proposed architecture features a pipelined Finite-field Arithmetic Unit and a simple and highly efficient instruction set architecture. Compared to the best existing work on Xilinx Zynq 7000 series FPGA, our implementation with one Processing Element can achieve 3.14 × speedup on the same device. To the best of our knowledge, our implementation appears to be the fastest among the state-of-the-art works. We also have implemented our architecture consisting of 4 Compute Groups, each with 16 PEs, on an Intel Agilex AGF027 FPGA. The measured performance of 4.48 Mops/s is achieved at the cost of 86 Watts power, which is the record-setting performance for point multiplication over Curve25519 on FPGAs.</p>","PeriodicalId":49248,"journal":{"name":"ACM Transactions on Reconfigurable Technology and Systems","volume":"80 4","pages":""},"PeriodicalIF":2.3,"publicationDate":"2023-07-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138504979","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Topgun: An ECC Accelerator for Private Set Intersection","authors":"Guiming Wu, Qianwen He, Jiali Jiang, Zhenxiang Zhang, Yuan Zhao, Yinchao Zou, Jie Zhang, Changzheng Wei, Ying Yan, Hui Zhang","doi":"10.1145/3603114","DOIUrl":"https://doi.org/10.1145/3603114","url":null,"abstract":"Elliptic Curve Cryptography (ECC), one of the most widely used asymmetric cryptographic algorithms, has been deployed in Transport Layer Security (TLS) protocol, blockchain, secure multiparty computation, etc. As one of the most secure ECC curves, Curve25519 is employed by some secure protocols, such as TLS 1.3 and Diffie-Hellman Private Set Intersection (DH-PSI) protocol. High performance implementation of ECC is required, especially for the DH-PSI protocol used in privacy-preserving platform. Point multiplication, the chief cryptographic primitive in ECC, is computationally expensive. To improve the performance of DH-PSI protocol, we propose Topgun, a novel and high-performance hardware architecture for point multiplication over Curve25519. The proposed architecture features a pipelined Finite-field Arithmetic Unit and a simple and highly efficient instruction set architecture. Compared to the best existing work on Xilinx Zynq 7000 series FPGA, our implementation with one Processing Element can achieve 3.14 × speedup on the same device. To the best of our knowledge, our implementation appears to be the fastest among the state-of-the-art works. We also have implemented our architecture consisting of 4 Compute Groups, each with 16 PEs, on an Intel Agilex AGF027 FPGA. The measured performance of 4.48 Mops/s is achieved at the cost of 86 Watts power, which is the record-setting performance for point multiplication over Curve25519 on FPGAs.","PeriodicalId":49248,"journal":{"name":"ACM Transactions on Reconfigurable Technology and Systems","volume":" ","pages":""},"PeriodicalIF":2.3,"publicationDate":"2023-07-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"44420898","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"NAPOLY: A Non-deterministic Automata Processor OverLaY","authors":"Rasha Karakchi, Jason D. Bakos","doi":"https://dl.acm.org/doi/10.1145/3593586","DOIUrl":"https://doi.org/https://dl.acm.org/doi/10.1145/3593586","url":null,"abstract":"<p>Deterministic and Non-deterministic Finite Automata (DFA and NFA) comprise the core of many big data applications. Recent efforts to develop Domain-Specific Architectures (DSAs) for DFA/NFA have taken divergent approaches, but achieving consistent throughput for arbitrarily-large pattern sets, state activation rates, and pattern match rates remains a challenge. In this article, we present NAPOLY (Non-Deterministic Automata Processor OverLaY), an FPGA overlay and associated compiler. A common limitation of prior efforts is a limit on NFA size for achieving the advertised throughput. NAPOLY is optimized for fast re-programming to permit practical time-division multiplexing of the hardware and permit high asymptotic throughput for NFAs of unlimited size, unlimited state activation rate, and high pattern reporting rate. NAPOLY also allows for offline generation of configurations having tradeoffs between state capacity and transition capacity. In this article, we (1) evaluate NAPOLY using benchmarks packaged in the ANMLZoo benchmark suite, (2) evaluate the use of an SAT solver for allocating physical resources, and (3) compare NAPOLY’s performance against existing solutions. NAPOLY performs most favorably on larger benchmarks, benchmarks with higher state activation frequency, and benchmarks with higher reporting frequency. NAPOLY outperforms the fastest of the CPU and GPU implementations in 10 out of 12 benchmarks.</p>","PeriodicalId":49248,"journal":{"name":"ACM Transactions on Reconfigurable Technology and Systems","volume":"86 4","pages":""},"PeriodicalIF":2.3,"publicationDate":"2023-06-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138505002","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Carol Jingyi Li, Xiangwei Li, Binglei Lou, Craig T. Jin, David Boland, Philip H. W. Leong
{"title":"Fixed-point FPGA Implementation of the FFT Accumulation Method for Real-time Cyclostationary Analysis","authors":"Carol Jingyi Li, Xiangwei Li, Binglei Lou, Craig T. Jin, David Boland, Philip H. W. Leong","doi":"https://dl.acm.org/doi/10.1145/3567429","DOIUrl":"https://doi.org/https://dl.acm.org/doi/10.1145/3567429","url":null,"abstract":"<p>The spectral correlation density (SCD) is an important tool in cyclostationary signal detection and classification. Even using efficient techniques based on the fast Fourier transform (FFT), real-time implementations are challenging because of the high computational complexity. A key dimension for computational optimization lies in minimizing the wordlength employed. In this article, we analyze the relationship between wordlength and signal-to-quantization noise in fixed-point implementations of the SCD function. A canonical SCD estimation algorithm, the FFT accumulation method (FAM) using fixed-point arithmetic, is studied. We derive closed-form expressions for SQNR and compare them at wordlengths ranging from 14 to 26 bits. The differences between the calculated SQNR and bit-exact simulations are less than 1 dB. Furthermore, an HLS-based FPGA design is implemented on a Xilinx Zynq UltraScale+ XCZU28DR-2FFVG1517E RFSoC. Using less than 25% of the logic fabric on the device, it consumes 7.7 W total on-chip power and has a power efficiency of 12.4 GOPS/W, which is an order of magnitude improvement over an Nvidia Tesla K40 graphics processing unit (GPU) implementation. In terms of throughput, it achieves 50 MS/sec, which is a speedup of 1.6 over a recent optimized FPGA implementation.</p>","PeriodicalId":49248,"journal":{"name":"ACM Transactions on Reconfigurable Technology and Systems","volume":"78 2","pages":""},"PeriodicalIF":2.3,"publicationDate":"2023-06-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138505000","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}