{"title":"Accurate and Efficient Hyperbolic Tangent Activation Function on FPGA using the DCT Interpolation Filter (Abstract Only)","authors":"A. Abdelsalam, J. Langlois, F. Cheriet","doi":"10.1145/3020078.3021768","DOIUrl":"https://doi.org/10.1145/3020078.3021768","url":null,"abstract":"Implementing an accurate and fast activation function with low cost is a crucial aspect to the implementation of Deep Neural Networks (DNNs) on FPGAs. We propose a high accuracy approximation approach for the hyperbolic tangent activation function of artificial neurons in DNNs. It is based on the Discrete Cosine Transform Interpolation Filter (DCTIF). The proposed interpolation architecture combines simple arithmetic operations on the stored samples of the hyperbolic tangent function and on input data. The proposed implementation outperforms the existing implementations in terms of accuracy while using the same or fewer computational and memory resources. The proposed architecture can approximate the hyperbolic tangent activation function with 2×10-4 maximum error while requiring only 1.12 Kbits memory and 21 LUTs of a Virtex-7 FPGA.","PeriodicalId":252039,"journal":{"name":"Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays","volume":"16 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-02-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130302875","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Storage-Efficient Batching for Minimizing Bandwidth of Fully-Connected Neural Network Layers (Abstract Only)","authors":"Yongming Shen, M. Ferdman, Peter Milder","doi":"10.1145/3020078.3021795","DOIUrl":"https://doi.org/10.1145/3020078.3021795","url":null,"abstract":"Convolutional neural networks (CNNs) are used to solve many challenging machine learning problems. These networks typically use convolutional layers for feature extraction and fully-connected layers to perform classification using those features. Significant interest in improving the performance of CNNs has led to the design of CNN accelerators to improve their evaluation throughput and efficiency. However, work on CNN accelerators has mostly concentrated on accelerating the computationally-intensive convolutional layers, while a major bottleneck of the existing designs arises due to the data-intensive fully-connected layers. Unfortunately, the leading approaches to reducing bandwidth of the fully-connected layers are limited by the storage capacity of the on-chip buffers. We observe that, in addition to the possibility of reducing CNN weight transfer bandwidth by adding more on-chip buffers, it is also possible to reduce the size of the on-chip buffers at the cost of CNN input transfer. Paradoxically, shrinking the size of the on-chip buffers costs significantly less input bandwidth than the weight bandwidth saved by adding more buffers. Leveraging these observations, we develop a design methodology for fully-connected layer accelerators that require substantially less off-chip bandwidth by balancing between the input and weight transfers. Using 160KB of BRAM enables the prior work to reduce off-chip bandwidth by 5x on the most bandwidth-intensive fully-connected layers of the popular AlexNet and VGGNet-E networks. With our newly proposed methodology, using the same 160KB of BRAM produces a design with 71x bandwidth reduction on the same networks.","PeriodicalId":252039,"journal":{"name":"Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays","volume":"49 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-02-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115872445","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"FPGAs in the Cloud","authors":"G. Constantinides","doi":"10.1145/3020078.3030014","DOIUrl":"https://doi.org/10.1145/3020078.3030014","url":null,"abstract":"Ever greater amounts of computing and storage are happening remotely in the cloud, and it is estimated that spending on public cloud services will grow by over 19%/year to $140B in 2019. Besides commodity processors, network and storage infrastructure, the end of clock frequency scaling in traditional processors has meant that application-specific accelerators are required in tandem with cloud-based processors to deliver continued improvements in computational performance and energy efficiency. Indeed, graphics processing units (GPUs), as well as custom ASICs, are now widely used within the cloud, particularly for compute-intensive high-value applications like machine learning. In this panel, we intend to consider the opportunities and challenges for broad deployment of FPGAs in the cloud.","PeriodicalId":252039,"journal":{"name":"Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays","volume":"33 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-02-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128112868","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Srinivas Siripurapu, Aman Gayasen, P. Gopalakrishnan, N. Chandrachoodan
{"title":"FPGA Implementation of Non-Uniform DFT for Accelerating Wireless Channel Simulations (Abstract Only)","authors":"Srinivas Siripurapu, Aman Gayasen, P. Gopalakrishnan, N. Chandrachoodan","doi":"10.1145/3020078.3021800","DOIUrl":"https://doi.org/10.1145/3020078.3021800","url":null,"abstract":"FPGAs have been used as accelerators in a wide variety of domains such as learning, search, genomics, signal processing, compression, analytics and so on. In recent years, the availability of tools and flows such as high-level synthesis has made it even easier to accelerate a variety of high-performance computing applications onto FPGAs. In this paper we propose a systematic methodology for optimizing the performance of an accelerated block using the notion of compute intensity to guide optimizations in high-level synthesis. We demonstrate the effectiveness of our methodology on an FPGA implementation of a non-uniform discrete Fourier transform (NUDFT), used to convert a wireless channel model from the time-domain to the frequency domain. The acceleration of this particular computation can be used to improve the performance and capacity of wireless channel simulation, which has wide applications in the system level design and performance evaluation of wireless networks. Our results show that our FPGA implementation outperforms the same code offloaded onto GPUs and CPUs by 1.6x and 10x respectively, in performance as measured by the throughput of the accelerated block. The gains in performance per watt versus GPUs and CPUs are 15.6x and 41.5x respectively.","PeriodicalId":252039,"journal":{"name":"Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays","volume":"15 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-02-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133560705","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Stochastic-Based Multi-stage Streaming Realization of a Deep Convolutional Neural Network (Abstract Only)","authors":"Mohammed Alawad, Mingjie Lin","doi":"10.1145/3020078.3021788","DOIUrl":"https://doi.org/10.1145/3020078.3021788","url":null,"abstract":"Large-scale convolutional neural network (CNN), conceptually mimicking the operational principle of visual perception in human brain, has been widely applied to tackle many challenging computer vision and artificial intelligence applications. Unfortunately, despite of its simple architecture, a typically sized CNN is well known to be computationally intensive. This work presents a novel stochastic-based and scalable hardware architecture and circuit design that computes a large-scale CNN with FPGA. The key idea is to implement all key components of a deep learning CNN, including multi-dimensional convolution, activation, and pooling layers, completely in the probabilistic computing domain in order to achieve high computing robustness, high performance, and low hardware usage. Most importantly, through both theoretical analysis and FPGA hardware implementation, we demonstrate that stochastic-based deep CNN can achieve superior hardware scalability when compared with its conventional deterministic-based FPGA implementation by allowing a stream computing mode and adopting efficient random sample manipulations. Overall, being highly scalable and energy efficient, our stochastic-based convolutional neural network architecture is well-suited for a modular vision engine with the goal of performing real-time detection, recognition and segmentation of mega-pixel images, especially those perception-based computing tasks that are inherently fault-tolerant, while still requiring high energy efficiency.","PeriodicalId":252039,"journal":{"name":"Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays","volume":"33 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-02-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131456256","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Session details: CAD Tools","authors":"Lesley Shannon","doi":"10.1145/3257187","DOIUrl":"https://doi.org/10.1145/3257187","url":null,"abstract":"","PeriodicalId":252039,"journal":{"name":"Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays","volume":" 10","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-02-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"113949246","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Improving the Performance of OpenCL-based FPGA Accelerator for Convolutional Neural Network","authors":"Jialiang Zhang, J. Li","doi":"10.1145/3020078.3021698","DOIUrl":"https://doi.org/10.1145/3020078.3021698","url":null,"abstract":"OpenCL FPGA has recently gained great popularity with emerging needs for workload acceleration such as Convolutional Neural Network (CNN), which is the most popular deep learning architecture in the domain of computer vision. While OpenCL enhances the code portability and programmability of FPGA, it comes at the expense of performance. The key challenge is to optimize the OpenCL kernels to efficiently utilize the flexible hardware resources in FPGA. Simply optimizing the OpenCL kernel code through various compiler options turns out insufficient to achieve desirable performance for both compute-intensive and data-intensive workloads such as convolutional neural networks. In this paper, we first propose an analytical performance model and apply it to perform an in-depth analysis on the resource requirement of CNN classifier kernels and available resources on modern FPGAs. We identify that the key performance bottleneck is the on-chip memory bandwidth. We propose a new kernel design to effectively address such bandwidth limitation and to provide an optimal balance between computation, on-chip, and off-chip memory access. As a case study, we further apply these techniques to design a CNN accelerator based on the VGG model. Finally, we evaluate the performance of our CNN accelerator using an Altera Arria 10 GX1150 board. We achieve 866 Gop/s floating point performance at 370MHz working frequency and 1.79 Top/s 16-bit fixed-point performance at 385MHz. To the best of our knowledge, our implementation achieves the best power efficiency and performance density compared to existing work.","PeriodicalId":252039,"journal":{"name":"Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays","volume":"33 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-02-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115281475","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"An FPGA Overlay Architecture for Cost Effective Regular Expression Search (Abstract Only)","authors":"Thomas Luinaud, Y. Savaria, J. Langlois","doi":"10.1145/3020078.3021770","DOIUrl":"https://doi.org/10.1145/3020078.3021770","url":null,"abstract":"Snort and Bro are Deep Packet Inspection systems which express complex rules with regular expressions. Before performing a regular expression search, these applications apply a filter to select which regular expressions must be searched. One way to search a regular expression is through a Nondeterministic Finite Automaton (NFA). Traversing an NFA is very time consuming on a sequential machine like a CPU. One solution so is to implement the NFA into hardware. Since FPGAs are reconfigurable and are massively parallel they are a good solution. Moreover, with the advent of platforms combining FPGAs and CPUs, implementing accelerators into FPGA becomes very interesting. Even though FPGAs are reconfigurable, the reconfiguration time can be too long in some cases. This paper thus proposes an overlay architecture that can efficiently find matches for regular expressions. The architecture contains multiple contexts that allow fast reconfiguration. Based on the results of a string filter, a context is selected and regular expression search is performed. The proposed design can support all rules from a set such as Snort while significantly reducing compute resources and allowing fast context updates. An example architecture was implemented on a Xilinx® xc7a200 Artix-7. It achieves a throughput of 100 million characters per second, requires 20 ns for a context switch, and occupies 9% of the slices and 85% of the BRAM resources of the FPGA.","PeriodicalId":252039,"journal":{"name":"Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays","volume":"140 2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-02-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116275014","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Ritchie Zhao, Weinan Song, Wentao Zhang, Tianwei Xing, Jeng-Hau Lin, M. Srivastava, Rajesh K. Gupta, Zhiru Zhang
{"title":"Accelerating Binarized Convolutional Neural Networks with Software-Programmable FPGAs","authors":"Ritchie Zhao, Weinan Song, Wentao Zhang, Tianwei Xing, Jeng-Hau Lin, M. Srivastava, Rajesh K. Gupta, Zhiru Zhang","doi":"10.1145/3020078.3021741","DOIUrl":"https://doi.org/10.1145/3020078.3021741","url":null,"abstract":"Convolutional neural networks (CNN) are the current stateof-the-art for many computer vision tasks. CNNs outperform older methods in accuracy, but require vast amounts of computation and memory. As a result, existing CNN applications are typically run on clusters of CPUs or GPUs. Studies into the FPGA acceleration of CNN workloads has achieved reductions in power and energy consumption. However, large GPUs outperform modern FPGAs in throughput, and the existence of compatible deep learning frameworks give GPUs a significant advantage in programmability. Recent research in machine learning demonstrates the potential of very low precision CNNs -- i.e., CNNs with binarized weights and activations. Such binarized neural networks (BNNs) appear well suited for FPGA implementation, as their dominant computations are bitwise logic operations and their memory requirements are reduced. A combination of low-precision networks and high-level design methodology may help address the performance and productivity gap between FPGAs and GPUs. In this paper, we present the design of a BNN accelerator that is synthesized from C++ to FPGA-targeted Verilog. The accelerator outperforms existing FPGA-based CNN accelerators in GOPS as well as energy and resource efficiency.","PeriodicalId":252039,"journal":{"name":"Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays","volume":"55 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-02-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129833380","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A Mixed-Signal Data-Centric Reconfigurable Architecture enabled by RRAM Technology (Abstract Only)","authors":"Yue Zha, Jialiang Zhang, Zhiqiang Wei, J. Li","doi":"10.1145/3020078.3021759","DOIUrl":"https://doi.org/10.1145/3020078.3021759","url":null,"abstract":"This poster presents a data-centric reconfigurable architecture, which is enabled by emerging non-volatile memory, i.e., RRAM. Compared to the heterogeneous architecture of commercial FPGAs, it is inherently a homogeneous architecture comprising of a two-dimensional (2D) array of mixed-signal processing \"tiles\". Each tile can be configured into one or a combination of the four modes: logic, memory, TCAM, and interconnect. Computation within a tile is performed in analog domain for energy efficiency, whereas communication between tiles is performed in digital domain for resilience. Such flexibility allows users to partition resources based on applications' needs, in contrast to fixed hardware design using dedicated hard IP blocks in FPGAs. In addition to better resource usage, its \"memory friendly\" architecture effectively addressed the limitations of commercial FPGAs i.e., scarce on-chip memory resources, making it an effective complement to FPGAs. Moreover, its coarse-grained configuration results in shallower logic depth, less inter-tile routing overhead, and thus smaller area and better performance, compared with its FPGA counter part. Our preliminary study shows great promise of this architecture for improving performance, energy efficiency and security.","PeriodicalId":252039,"journal":{"name":"Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays","volume":"64 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-02-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126507626","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}