The 2021 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays最新文献_第4页

NetCracker: A Peek into the Routing Architecture of Xilinx 7-Series FPGAs NetCracker:窥视Xilinx 7系列fpga的路由架构

The 2021 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays Pub Date : 2021-02-17 DOI: 10.1145/3431920.3439285

Morten B. Petersen, Stefan Nikolic, Mirjana Stojilović

{"title":"NetCracker: A Peek into the Routing Architecture of Xilinx 7-Series FPGAs","authors":"Morten B. Petersen, Stefan Nikolic, Mirjana Stojilović","doi":"10.1145/3431920.3439285","DOIUrl":"https://doi.org/10.1145/3431920.3439285","url":null,"abstract":"Novel applications have triggered significant changes at the system level of FPGA architecture design, such as the introduction of embedded VLIW processor arrays and hardened NoCs. However, the routing architecture of the soft logic fabric has largely remained unchanged in recent years. Since hunger for acceleration of ever more varied tasks with various power budgets---as well as complications related to technology scaling---is likely to remain significant, it is foreseeable that the routing architecture too will have to evolve. In this work, we do not try to suggest how routing architectures of tomorrow should look like. Instead, we analyze an existing architecture from a popular commercial FPGA family, discussing the possible origins of various design decisions and pointing out aspects that may merit future research. Moreover, we present an open-source tool that greatly eases such analyses, relying only on data readily available from the vendor CAD tools. Our hope is that this work will help the academic research community in catching up with the current developments in industry and accelerate its contributions to FPGA architectures of the future.","PeriodicalId":386071,"journal":{"name":"The 2021 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-02-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128854179","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 10

Tensor Slices to the Rescue: Supercharging ML Acceleration on FPGAs 张量切片的拯救:fpga上的增压ML加速

The 2021 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays Pub Date : 2021-02-17 DOI: 10.1145/3431920.3439282

Aman Arora, Samidh Mehta, Vaughn Betz, L. John

{"title":"Tensor Slices to the Rescue: Supercharging ML Acceleration on FPGAs","authors":"Aman Arora, Samidh Mehta, Vaughn Betz, L. John","doi":"10.1145/3431920.3439282","DOIUrl":"https://doi.org/10.1145/3431920.3439282","url":null,"abstract":"FPGAs are well-suited for accelerating deep learning (DL) applications owing to the rapidly changing algorithms, network architectures and computation requirements in this field. However, the generic building blocks available on traditional FPGAs limit the acceleration that can be achieved. Many modifications to FPGA architecture have been proposed and deployed including adding specialized artificial intelligence (AI) processing engines, adding support for IEEE half-precision (fp16) math in DSP slices, adding hard matrix multiplier blocks, etc. In this paper, we describe replacing a small percentage of the FPGA's programmable logic area with Tensor Slices. These slices are arrays of processing elements at their heart that support multiple tensor operations, multiple dynamically-selectable precisions and can be dynamically fractured into individual adders, multipliers and MACs (multiply-and-accumulate). These tiles have a local crossbar at the inputs that helps with easing the routing pressure caused by a large slice. By spending ~3% of FPGA's area on Tensor Slices, we observe an average frequency increase of 2.45x and average area reduction by 0.41x across several ML benchmarks, including a TPU-like design, compared to an Intel Agilex-like baseline FPGA. We also study the impact of spending area on Tensor slices on non-ML applications. We observe an average reduction of 1% in frequency and an average increase of 1% in routing wirelength compared to the baseline, across the non-ML benchmarks we studied. Adding these ML-specific coarse-grained hard blocks makes the proposed FPGA a much efficient hardware accelerator for ML applications, while still keeping the vast majority of the real estate on the FPGA programmable at fine-grain.","PeriodicalId":386071,"journal":{"name":"The 2021 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-02-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122642134","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 15

Stratix 10 NX Architecture and Applications Stratix 10nx架构与应用

The 2021 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays Pub Date : 2021-02-17 DOI: 10.1145/3431920.3439293

M. Langhammer, E. Nurvitadhi, B. Pasca, Sergey Gribok

{"title":"Stratix 10 NX Architecture and Applications","authors":"M. Langhammer, E. Nurvitadhi, B. Pasca, Sergey Gribok","doi":"10.1145/3431920.3439293","DOIUrl":"https://doi.org/10.1145/3431920.3439293","url":null,"abstract":"The advent of AI has driven the adoption of high density low precision arithmetic on FPGAs. This has resulted in new methods in mapping both arithmetic functions as well as dataflows onto the fabric, as well as some changes to the embedded DSP Blocks. Technologies outside of the FPGA realm have also evolved, such as the addition of tensor structures for GPUs, and also the introduction of numerous AI ASSPs, all of which have a higher claimed performance and efficiency than current FPGAs. In this paper we will introduce the Stratix 10 NX device (NX), which is a variant of FPGA specifically optimized for the AI application space. In addition to the computational capabilities of the standard programmable soft logic fabric, a new type of DSP Block provides the dense arrays of low precision multipliers typically used in AI implementations. The architecture of the block is tuned for the common matrix-matrix or vector-matrix multiplications in AI, with capabilities designed to work efficiently for both small and large matrix sizes. The base precisions are INT8 and INT4, along with shared exponent support for support block floating point FP16 and FP12 numerics. All additions/accumulations can be done in INT32 or IEEE754 single precision floating point (FP32), and multiple blocks can be cascaded together to support larger matrices. We will also describe methods by which the smaller precision multipliers can be aggregated to create larger multiplier that are more applicable to standard signal processing requirements. In terms of overall compute throughput, Stratix 10 NX achieves 143 INT8/FP16 TOPs/FLOPs, or 286 INT4/FP12 TOPS/FLOPs at 600MHz. Depending on the configuration, power efficiency is in the range of 1-4 TOPs or TFLOPs/W.","PeriodicalId":386071,"journal":{"name":"The 2021 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-02-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129853899","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 23

3M-AI: A Multi-task and Multi-core Virtualization Framework for Multi-FPGA AI Systems in the Cloud 3M-AI:云环境下多fpga AI系统的多任务多核虚拟化框架

The 2021 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays Pub Date : 2021-02-17 DOI: 10.1145/3431920.3439480

Shulin Zeng, Guohao Dai, Hanbo Sun, Jun Liu, Hongren Zheng, Yusong Wu, Fan Zhang, Xinhao Yang, Yi Cai, Yu Wang, Huazhong Yang

{"title":"3M-AI: A Multi-task and Multi-core Virtualization Framework for Multi-FPGA AI Systems in the Cloud","authors":"Shulin Zeng, Guohao Dai, Hanbo Sun, Jun Liu, Hongren Zheng, Yusong Wu, Fan Zhang, Xinhao Yang, Yi Cai, Yu Wang, Huazhong Yang","doi":"10.1145/3431920.3439480","DOIUrl":"https://doi.org/10.1145/3431920.3439480","url":null,"abstract":"With the ever-growing demands for online Artificial Intelligence (AI), the hardware virtualization support for deep learning accelerators is vital for providing AI capability in the cloud. Three basic features, multi-task, dynamic workload, and remote access, are fundamental for hardware virtualization. However, most of the deep learning accelerators do not support concurrent execution of multiple tasks. Besides, the SOTA multi-DNN scheduling algorithm for NN accelerators neither consider the multi-task concurrent execution and resources allocation for the multi-core DNN accelerators. Moreover, existing GPU virtualized solutions could introduce a huge remote access latency overhead, resulting in a severe system performance drop. In order to tackle these challenges, we propose 3M-AI, a Multi-task and Multi-core virtualization framework for Multi-FPGA AI systems in the cloud. 3M-AI enables model parallelism on multi-FPGA by optimizing data synchronization and movement between FPGAs. 3M-AI exploits heuristic hardware resource allocation algorithm and accurate multi-core latency prediction model. 3M-AI significantly reduces the remote API access overhead to nearly 1%, and achieves better NN inference latency with a batch size 1 compared with GPU virtualization solutions.","PeriodicalId":386071,"journal":{"name":"The 2021 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays","volume":"2011 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-02-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130551301","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

Scientific Applications of FPGAs at the LHC fpga在大型强子对撞机中的科学应用

The 2021 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays Pub Date : 2021-02-17 DOI: 10.1145/3431920.3437119

P. Harris

{"title":"Scientific Applications of FPGAs at the LHC","authors":"P. Harris","doi":"10.1145/3431920.3437119","DOIUrl":"https://doi.org/10.1145/3431920.3437119","url":null,"abstract":"The next generation of high throughput data acquisition systems is capable of acquisition at rates far exceeding our ability to save data. To process data in real-time specialized computing systems are needed with incredibly high throughput so that data can be quickly assessed to determine whether it is sufficiently interesting for further processing. With a raw data rate exceeding 1 Petabit per second, particle detectors at the Large Hadron Collider at the Europe Center for Nuclear Research (CERN) contend with some of the largest data rates ever encountered. With planned upgrades in the near future, these rates will continue to grow, further complicating our ability to process data effectively to continue to understand the fundamental properties of the universe. In this talk, we present the current, FPGA-based, LHC data acquisition system, and we discuss the plenitude of data challenges that are currently being addressed. Furthermore, we discuss various aspects of the system, and we present deep learning base solutions that are quickly being adopted by the LHC. Furthermore, we discuss the lower throughput computationally complex systems and discuss how FPGAs can augment the system leading to enhanced physics performance. Throughout the talk, we discuss the scientific implications possible with an improved system. Finally, we discuss related problems in other scientific fields, including astrophysics and materials science. We present new challenges that, if solved, can open paths to new avenues of fundamental scientific research.","PeriodicalId":386071,"journal":{"name":"The 2021 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays","volume":"173 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-02-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132097290","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

CoDeNet: Efficient Deployment of Input-Adaptive Object Detection on Embedded FPGAs 嵌入式fpga上输入自适应目标检测的有效部署

The 2021 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays Pub Date : 2021-02-17 DOI: 10.1145/3431920.3439295

Zhen Dong, Dequan Wang, Qijing Huang, Yizhao Gao, Yaohui Cai, Tian Li, Bichen Wu, K. Keutzer, J. Wawrzynek

{"title":"CoDeNet: Efficient Deployment of Input-Adaptive Object Detection on Embedded FPGAs","authors":"Zhen Dong, Dequan Wang, Qijing Huang, Yizhao Gao, Yaohui Cai, Tian Li, Bichen Wu, K. Keutzer, J. Wawrzynek","doi":"10.1145/3431920.3439295","DOIUrl":"https://doi.org/10.1145/3431920.3439295","url":null,"abstract":"Deploying deep learning models on embedded systems for computer vision tasks has been challenging due to limited compute resources and strict energy budgets. The majority of existing work focuses on accelerating image classification, while other fundamental vision problems, such as object detection, have not been adequately addressed. Compared with image classification, detection problems are more sensitive to the spatial variance of objects, and therefore, require specialized convolutions to aggregate spatial information. To address this need, recent work introduces dynamic deformable convolution to augment regular convolutions. Regular convolutions process a fixed grid of pixels across all the spatial locations in an image, while dynamic deformable convolution may access arbitrary pixels in the image with the access pattern being input-dependent and varying with spatial location. These properties lead to inefficient memory accesses of inputs with existing hardware. In this work, we harness the flexibility of FPGAs to develop a novel object detection pipeline with deformable convolutions. We show the speed-accuracy tradeoffs for a set of algorithm modifications including irregular-access versus limited-range and fixed-shape on a flexible hardware accelerator. We evaluate these algorithmic changes with corresponding hardware optimizations and show a 1.36x and 9.76x speedup respectively for the full and depthwise deformable convolution on hardware with minor accuracy loss. We then co-design a network called CoDeNet with the modified deformable convolution for object detection and quantize the network to 4-bit weights and 8-bit activations. With our high-efficiency implementation, our solution reaches 26.9 frames per second with a tiny model size of 0.76 MB while achieving 61.7 AP50 on the standard object detection dataset, Pascal VOC. With our higher-accuracy implementation, our model gets to 67.1 AP50 on Pascal VOC with only 2.9 MB of parameters--20.9x smaller but 10% more accurate than Tiny-YOLO.","PeriodicalId":386071,"journal":{"name":"The 2021 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays","volume":"21 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-02-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133922411","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 37

LEAP: A Deep Learning based Aging-Aware Architecture Exploration Framework for FPGAs 基于深度学习的fpga老化感知架构探索框架

The 2021 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays Pub Date : 2021-02-17 DOI: 10.1145/3431920.3439459

B. Ghavami, Seyed Milad Ebrahimi, Zhenman Fang, Lesley Shannon

{"title":"LEAP: A Deep Learning based Aging-Aware Architecture Exploration Framework for FPGAs","authors":"B. Ghavami, Seyed Milad Ebrahimi, Zhenman Fang, Lesley Shannon","doi":"10.1145/3431920.3439459","DOIUrl":"https://doi.org/10.1145/3431920.3439459","url":null,"abstract":"Transistor aging raises a vital lifetime reliability challenge for FPGA devices in advanced technology nodes. In this paper, we design a tool called LEAP to enable the aging-aware FPGA architecture exploration. The core idea of LEAP is to efficiently model the aging-induced delay degradation at the coarse-grained FPGA basic block level using deep neural networks (DNNs), while achieving almost the same accuracy as the transistor-level simulation. For each type of the FPGA basic block such as LUT and DSP, we first characterize its accurate delay degradation via transistor-level SPICE simulation under a versatile set of aging factors from the FPGA fabric and in-field operation. Then we train one DNN model for each block type to learn the relation between its delay degradation and aging factors. Moreover, we integrate our DNN models into the widely used Verilog-to-Routing (VTR 8) toolflow and generate the aging-aware FPGA architecture file. Experimental results demonstrate that our proposed flow can predict the delay degradation of FPGA blocks more than 104x to 107x faster than transistor-level SPICE simulation, with the maximum prediction error of less than 0.7%. Therefore, FPGA architects can leverage LEAP to explore better aging-aware FPGA architectures.","PeriodicalId":386071,"journal":{"name":"The 2021 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays","volume":"61 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-02-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127352721","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

ThunderGP: HLS-based Graph Processing Framework on FPGAs 基于hls的fpga图形处理框架

The 2021 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays Pub Date : 2021-02-17 DOI: 10.1145/3431920.3439290

Xinyu Chen, Hongshi Tan, Yao Chen, Bingsheng He, W. Wong, Deming Chen

{"title":"ThunderGP: HLS-based Graph Processing Framework on FPGAs","authors":"Xinyu Chen, Hongshi Tan, Yao Chen, Bingsheng He, W. Wong, Deming Chen","doi":"10.1145/3431920.3439290","DOIUrl":"https://doi.org/10.1145/3431920.3439290","url":null,"abstract":"FPGA has been an emerging computing infrastructure in datacenters benefiting from features of fine-grained parallelism, energy efficiency, and reconfigurability. Meanwhile, graph processing has attracted tremendous interest in data analytics, and its performance is in increasing demand with the rapid growth of data. Many works have been proposed to tackle the challenges of designing efficient FPGA-based accelerators for graph processing. However, the largely overlooked programmability still requires hardware design expertise and sizable development efforts from developers. In order to close the gap, we propose ThunderGP, an open-source HLS-based graph processing framework on FPGAs, with which developers could enjoy the performance of FPGA-accelerated graph processing by writing only a few high-level functions with no knowledge of the hardware. ThunderGP adopts the Gather-Apply-Scatter (GAS) model as the abstraction of various graph algorithms and realizes the model by a build-in highly-paralleled and memory-efficient accelerator template. With high-level functions as inputs, ThunderGP automatically explores the massive resources and memory bandwidth of multiple Super Logic Regions (SLRs) on FPGAs to generate accelerator and then deploys the accelerator and schedules tasks for the accelerator. We evaluate ThunderGP with seven common graph applications. The results show that accelerators on real hardware platforms deliver 2.9 times speedup over the state-of-the-art approach, running at 250MHz and achieving throughput up to 6,400 MTEPS (Million Traversed Edges Per Second). We also conduct a case study with ThunderGP, which delivers up to 419 times speedup over the CPU-based design and requires significantly reduced development efforts. This work is open-sourced on Github at https://github.com/Xtra-Computing/ThunderGP.","PeriodicalId":386071,"journal":{"name":"The 2021 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays","volume":"12 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-02-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126612112","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 53

Modeling FPGA-Based Systems via Few-Shot Learning 通过Few-Shot学习建模fpga系统

The 2021 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays Pub Date : 2021-02-17 DOI: 10.1145/3431920.3439460

Gagandeep Singh, Dionysios Diamantopolous, Juan Gómez-Luna, S. Stuijk, O. Mutlu, H. Corporaal

{"title":"Modeling FPGA-Based Systems via Few-Shot Learning","authors":"Gagandeep Singh, Dionysios Diamantopolous, Juan Gómez-Luna, S. Stuijk, O. Mutlu, H. Corporaal","doi":"10.1145/3431920.3439460","DOIUrl":"https://doi.org/10.1145/3431920.3439460","url":null,"abstract":"Machine-learning-based models have recently gained traction as a way to overcome the slow downstream implementation process of FPGAs by building models that provide fast and accurate performance predictions. However, these models suffer from two main limitations: (1) a model trained for a specific environment cannot predict for a new, unknown environment; (2) training requires large amounts of data (features extracted from FPGA synthesis and implementation reports), which is cost-inefficient because of the time-consuming FPGA design cycle. In various systems (e.g., cloud systems), where getting access to platforms is typically costly, error-prone, and sometimes infeasible, collecting enough data is even more difficult. Our research aims to answer the following question: for an FPGA-based system, can we leverage and transfer our ML-based performance models trained on a low-end local system to a new, unknown, high-end FPGA-based system, thereby avoiding the aforementioned two main limitations of traditional ML-based approaches? To this end, we propose a transfer-learning-based approach for FPGA-based systems that adapts an existing ML-based model to a new, unknown environment to provide fast and accurate performance and resource utilization predictions.","PeriodicalId":386071,"journal":{"name":"The 2021 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays","volume":"24 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-02-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129435253","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 4

FracBNN: Accurate and FPGA-Efficient Binary Neural Networks with Fractional Activations FracBNN:精确的、fpga高效的分数激活二元神经网络

The 2021 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays Pub Date : 2020-12-22 DOI: 10.1145/3431920.3439296

Yichi Zhang, Junhao Pan, Xinheng Liu, Hongzheng Chen, Deming Chen, Zhiru Zhang

{"title":"FracBNN: Accurate and FPGA-Efficient Binary Neural Networks with Fractional Activations","authors":"Yichi Zhang, Junhao Pan, Xinheng Liu, Hongzheng Chen, Deming Chen, Zhiru Zhang","doi":"10.1145/3431920.3439296","DOIUrl":"https://doi.org/10.1145/3431920.3439296","url":null,"abstract":"Binary neural networks (BNNs) have 1-bit weights and activations. Such networks are well suited for FPGAs, as their dominant computations are bitwise arithmetic and the memory requirement is also significantly reduced. However, compared to start-of-the-art compact convolutional neural network (CNN) models, BNNs tend to produce a much lower accuracy on realistic datasets such as ImageNet. In addition, the input layer of BNNs has gradually become a major compute bottleneck, because it is conventionally excluded from binarization to avoid a large accuracy loss. This work proposes FracBNN, which exploits fractional activations to substantially improve the accuracy of BNNs. Specifically, our approach employs a dual-precision activation scheme to compute features with up to two bits, using an additional sparse binary convolution. We further binarize the input layer using a novel thermometer encoding. Overall, FracBNN preserves the key benefits of conventional BNNs, where all convolutional layers are computed in pure binary MAC operations (BMACs). We design an efficient FPGA-based accelerator for our novel BNN model that supports the fractional activations. To evaluate the performance of FracBNN under a resource-constrained scenario, we implement the entire optimized network architecture on an embedded FPGA (Xilinx Ultra96 v2). Our experiments on ImageNet show that FracBNN achieves an accuracy comparable to MobileNetV2, surpassing the best-known BNN design on FPGAs with an increase of 28.9% in top-1 accuracy and a 2.5x reduction in model size. FracBNN also outperforms a recently introduced BNN model with an increase of 2.4% in top-1 accuracy while using the same model size. On the embedded FPGA device, FracBNN demonstrates the ability of real-time image classification.","PeriodicalId":386071,"journal":{"name":"The 2021 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays","volume":"28 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-12-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128131076","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 51