{"title":"HGum: Messaging Framework for Hardware Accelerators (Abstact Only)","authors":"Sizhuo Zhang, Hari Angepat, Derek Chiou","doi":"10.1145/2847263.2847289","DOIUrl":"https://doi.org/10.1145/2847263.2847289","url":null,"abstract":"Software messaging frameworks help avoid errors and reduce engineering effort in building distributed systems by (i) providing an interface definition language (IDL) to precisely specify the structure of the message (the message schema) and (ii) automatically generating the serialization and deserialization functions that transform user data structures into binary data for sending across the network and vice versa. Similarly, a hardware-accelerated system that consists of host software and multiple FPGAs, could also benefit from a messaging framework to handle messages both between software and FPGA and also between different FPGAs. The key challenge for a hardware messaging framework is that it must be able to support large messages with complex schema while meeting critical constraints such as clock frequency, area, and throughput. We present HGum, a messaging framework for hardware accelerators that meets all the above requirements. HGum is able to generate high-performance and low-cost hardware logic by employing a novel design that algorithmically parses the message schema to perform serialization and deserialization. Our evaluation of HGum shows that it not only significantly reduces engineering effort but also generates hardware with comparable quality to manual implementation.","PeriodicalId":438572,"journal":{"name":"Proceedings of the 2016 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-02-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128690888","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Efficient Memory Partitioning for Parallel Data Access via Data Reuse","authors":"Jincheng Su, Fan Yang, Xuan Zeng, Dian Zhou","doi":"10.1145/2847263.2847264","DOIUrl":"https://doi.org/10.1145/2847263.2847264","url":null,"abstract":"In this paper, we propose an efficient memory partitioning algorithm for parallel data access via data reuse. We found that for most of the applications in image and video processing, a large amount of data can be reused among different iterations in a loop nest. Motivated by this observation, we propose to cache these reusable data by on-chip registers. The on-chip registers used to cache the re-fetched data can be organized as chains of registers. The non-reusable data are then partitioned into several memory banks by a memory partition algorithm. We revise the existing padding method to cover cases occurring frequently in our method that some components of partition vector are zeros. Experimental results have demonstrated that compared with the state-of-the-art algorithms the proposed method can reduce the required number of memory banks by 59.8% on average. The corresponding resources for bank mapping is also significantly reduced. The number of LUTs is reduced by 78.6%. The number of Flip-Flops is reduced by 66.8%. The number of DSP48Es is reduced by 41.7%. Moreover, the storage overheads of the proposed method are zeros for most of the widely used access patterns in image filtering.","PeriodicalId":438572,"journal":{"name":"Proceedings of the 2016 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays","volume":"43 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-02-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132038486","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Doubling FPGA Throughput via a Soft SerDes Architecture for Full-Bandwidth Serial Pipelining (Abstract Only)","authors":"Aaron Landy, G. Stitt","doi":"10.1145/2847263.2847301","DOIUrl":"https://doi.org/10.1145/2847263.2847301","url":null,"abstract":"Serial arithmetic has been shown to offer attractive advantages in area, clock frequency, and functional density for FPGA datapaths but suffers from a significant reduction in throughput compared to traditional bit-parallel designs that is prohibitive for many applications. In this work, we present a full-bandwidth SerDes architecture specialized for Xilinx FPGAs that enables serial pipelines to accept inputs and generate outputs at the same rate as bit-parallel pipelines. When combined with the clock improvements from serial pipelines, we show that this approach offers more than 2.1x average increase in throughput compared to bit-parallel pipelines. Although previous work has shown that serial pipelines can achieve similar results for some limited situations, the key contribution of this work is the ability to replace potentially any existing FPGA pipeline with a higher throughput serialized alternative. We also present a serialized sliding-window architecture that improves throughput up to 4x.","PeriodicalId":438572,"journal":{"name":"Proceedings of the 2016 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays","volume":"41 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-02-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125387843","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"HyperPipelining of High-Speed Interface Logic","authors":"Gregg Baeckler","doi":"10.1145/2847263.2847285","DOIUrl":"https://doi.org/10.1145/2847263.2847285","url":null,"abstract":"The throughput needs of networking designs on FPGAs are constantly growing -- from 40Gbps to 100Gbps, 400Gbps and beyond. A 400G Ethernet MAC needs to process wide data at high speeds to meet the throughput needs. Altera recently introduced HyperFlexTM [1][2][3], a change to the fabric architecture aimed to facilitate massive pipelining of FPGA designs -- allowing them to run faster and hence alleviate the congestion that is caused by widening datapaths beyond 512b or 1024b. Though it seems counterintuitive it can be easier to close timing at 781 MHz for a 640b datapath than at 390 MHz for a 1280b datapath when wire congestion is taken into account. This presentation will discuss some of the practical details in implementing high-throughput protocols such as Ethernet and Interlaken, how we address these traditionally and how the design of the cores is modified with HyperPipelining. We will discuss alternative development styles for control and datapath logic, strategies for wire planning to avoid congestion, the throughput limits of FPGA routing networks, common timing closure issues and how to alleviate them, and how to pipeline intelligently. This presentation is thus partly a tutorial in the issues of making a 400G FPGA design close timing, and partly a case study of using HyperFlex on an FPGA design.","PeriodicalId":438572,"journal":{"name":"Proceedings of the 2016 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays","volume":"74 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-02-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121330591","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Jiantao Qiu, Jie Wang, Song Yao, Kaiyuan Guo, Boxun Li, Erjin Zhou, Jincheng Yu, Tianqi Tang, Ningyi Xu, Sen Song, Yu Wang, Huazhong Yang
{"title":"Going Deeper with Embedded FPGA Platform for Convolutional Neural Network","authors":"Jiantao Qiu, Jie Wang, Song Yao, Kaiyuan Guo, Boxun Li, Erjin Zhou, Jincheng Yu, Tianqi Tang, Ningyi Xu, Sen Song, Yu Wang, Huazhong Yang","doi":"10.1145/2847263.2847265","DOIUrl":"https://doi.org/10.1145/2847263.2847265","url":null,"abstract":"In recent years, convolutional neural network (CNN) based methods have achieved great success in a large number of applications and have been among the most powerful and widely used techniques in computer vision. However, CNN-based methods are com-putational-intensive and resource-consuming, and thus are hard to be integrated into embedded systems such as smart phones, smart glasses, and robots. FPGA is one of the most promising platforms for accelerating CNN, but the limited bandwidth and on-chip memory size limit the performance of FPGA accelerator for CNN. In this paper, we go deeper with the embedded FPGA platform on accelerating CNNs and propose a CNN accelerator design on embedded FPGA for Image-Net large-scale image classification. We first present an in-depth analysis of state-of-the-art CNN models and show that Convolutional layers are computational-centric and Fully-Connected layers are memory-centric. Then the dynamic-precision data quantization method and a convolver design that is efficient for all layer types in CNN are proposed to improve the bandwidth and resource utilization. Results show that only 0.4% accuracy loss is introduced by our data quantization flow for the very deep VGG16 model when 8/4-bit quantization is used. A data arrangement method is proposed to further ensure a high utilization of the external memory bandwidth. Finally, a state-of-the-art CNN, VGG16-SVD, is implemented on an embedded FPGA platform as a case study. VGG16-SVD is the largest and most accurate network that has been implemented on FPGA end-to-end so far. The system on Xilinx Zynq ZC706 board achieves a frame rate at 4.45 fps with the top-5 accuracy of 86.66% using 16-bit quantization. The average performance of convolutional layers and the full CNN is 187.8 GOP/s and 137.0 GOP/s under 150MHz working frequency, which outperform previous approaches significantly.","PeriodicalId":438572,"journal":{"name":"Proceedings of the 2016 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays","volume":"85 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-02-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121712811","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Session details: Technical Session 1: Neural Networks and OpenCL","authors":"J. Anderson","doi":"10.1145/3250859","DOIUrl":"https://doi.org/10.1145/3250859","url":null,"abstract":"","PeriodicalId":438572,"journal":{"name":"Proceedings of the 2016 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays","volume":"15 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-02-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131347422","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Optimal Circuits for Streamed Linear Permutations Using RAM","authors":"F. Serre, Thomas Holenstein, Markus Püschel","doi":"10.1145/2847263.2847277","DOIUrl":"https://doi.org/10.1145/2847263.2847277","url":null,"abstract":"We propose a method to automatically derive hardware structures that perform a fixed linear permutation on streaming data. Linear permutations are permutations that map linearly the bit representation of the elements addresses. This set contains many of the most important permutations in media processing, communication, and other applications and includes perfect shuffles, stride permutations, and the bit reversal. Streaming means that the data to be permuted arrive as a sequence of chunks over several cycles. We solve this problem by mathematically decomposing a given permutation into a sequence of three permutations that are either temporal or spatial. The former are implemented as banks of RAM, the latter as switching networks. We prove optimality of our solution in terms of the number of switches in these networks.","PeriodicalId":438572,"journal":{"name":"Proceedings of the 2016 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays","volume":"9 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-02-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117061894","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Session details: Technical Session 8: Applications","authors":"G. Constantinides","doi":"10.1145/3250867","DOIUrl":"https://doi.org/10.1145/3250867","url":null,"abstract":"","PeriodicalId":438572,"journal":{"name":"Proceedings of the 2016 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays","volume":"12 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-02-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122037844","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Towards PVT-Tolerant Glitch-Free Operation in FPGAs","authors":"Safeen Huda, J. Anderson","doi":"10.1145/2847263.2847272","DOIUrl":"https://doi.org/10.1145/2847263.2847272","url":null,"abstract":"Glitches are unnecessary transitions on logic signals that needlessly consume dynamic power. Glitches arise from imbalances in the combinational path delays to a signal, which may cause the signal to toggle multiple times in a given clock cycle before settling to its final value. In this paper, we propose a low-cost circuit structure that is able to eliminate a majority of glitches. The structure, which is incorporated into the output buffers of FPGA logic elements, suppresses pulses on buffer outputs whose duration is shorter than a configurable time window (set at the time of FPGA configuration). Glitches are thereby eliminated \"at the source\" ensuring they do not propagate into the high-capacitance FPGA interconnect, saving power. An experimental study, using Altera commercial tools for power analysis, demonstrates that the proposed technique reduces 70% of glitches, at a cost of 1% reduction in speed performance.","PeriodicalId":438572,"journal":{"name":"Proceedings of the 2016 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays","volume":"94 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-02-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125114006","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}