Proceedings of the 2016 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays最新文献_第2页

HGum: Messaging Framework for Hardware Accelerators (Abstact Only) HGum:硬件加速器的消息传递框架(仅摘要)

Proceedings of the 2016 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays Pub Date : 2016-02-21 DOI: 10.1145/2847263.2847289

Sizhuo Zhang, Hari Angepat, Derek Chiou

{"title":"HGum: Messaging Framework for Hardware Accelerators (Abstact Only)","authors":"Sizhuo Zhang, Hari Angepat, Derek Chiou","doi":"10.1145/2847263.2847289","DOIUrl":"https://doi.org/10.1145/2847263.2847289","url":null,"abstract":"Software messaging frameworks help avoid errors and reduce engineering effort in building distributed systems by (i) providing an interface definition language (IDL) to precisely specify the structure of the message (the message schema) and (ii) automatically generating the serialization and deserialization functions that transform user data structures into binary data for sending across the network and vice versa. Similarly, a hardware-accelerated system that consists of host software and multiple FPGAs, could also benefit from a messaging framework to handle messages both between software and FPGA and also between different FPGAs. The key challenge for a hardware messaging framework is that it must be able to support large messages with complex schema while meeting critical constraints such as clock frequency, area, and throughput. We present HGum, a messaging framework for hardware accelerators that meets all the above requirements. HGum is able to generate high-performance and low-cost hardware logic by employing a novel design that algorithmically parses the message schema to perform serialization and deserialization. Our evaluation of HGum shows that it not only significantly reduces engineering effort but also generates hardware with comparable quality to manual implementation.","PeriodicalId":438572,"journal":{"name":"Proceedings of the 2016 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-02-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128690888","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

Efficient Memory Partitioning for Parallel Data Access via Data Reuse 通过数据重用实现并行数据访问的高效内存分区

Proceedings of the 2016 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays Pub Date : 2016-02-21 DOI: 10.1145/2847263.2847264

Jincheng Su, Fan Yang, Xuan Zeng, Dian Zhou

{"title":"Efficient Memory Partitioning for Parallel Data Access via Data Reuse","authors":"Jincheng Su, Fan Yang, Xuan Zeng, Dian Zhou","doi":"10.1145/2847263.2847264","DOIUrl":"https://doi.org/10.1145/2847263.2847264","url":null,"abstract":"In this paper, we propose an efficient memory partitioning algorithm for parallel data access via data reuse. We found that for most of the applications in image and video processing, a large amount of data can be reused among different iterations in a loop nest. Motivated by this observation, we propose to cache these reusable data by on-chip registers. The on-chip registers used to cache the re-fetched data can be organized as chains of registers. The non-reusable data are then partitioned into several memory banks by a memory partition algorithm. We revise the existing padding method to cover cases occurring frequently in our method that some components of partition vector are zeros. Experimental results have demonstrated that compared with the state-of-the-art algorithms the proposed method can reduce the required number of memory banks by 59.8% on average. The corresponding resources for bank mapping is also significantly reduced. The number of LUTs is reduced by 78.6%. The number of Flip-Flops is reduced by 66.8%. The number of DSP48Es is reduced by 41.7%. Moreover, the storage overheads of the proposed method are zeros for most of the widely used access patterns in image filtering.","PeriodicalId":438572,"journal":{"name":"Proceedings of the 2016 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays","volume":"43 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-02-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132038486","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 12

Doubling FPGA Throughput via a Soft SerDes Architecture for Full-Bandwidth Serial Pipelining (Abstract Only) 基于软SerDes架构的FPGA全带宽串行流水线吞吐量翻倍(仅摘要)

Proceedings of the 2016 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays Pub Date : 2016-02-21 DOI: 10.1145/2847263.2847301

Aaron Landy, G. Stitt

引用次数: 0

HyperPipelining of High-Speed Interface Logic 高速接口逻辑的超流水线

Proceedings of the 2016 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays Pub Date : 2016-02-21 DOI: 10.1145/2847263.2847285

Gregg Baeckler

{"title":"HyperPipelining of High-Speed Interface Logic","authors":"Gregg Baeckler","doi":"10.1145/2847263.2847285","DOIUrl":"https://doi.org/10.1145/2847263.2847285","url":null,"abstract":"The throughput needs of networking designs on FPGAs are constantly growing -- from 40Gbps to 100Gbps, 400Gbps and beyond. A 400G Ethernet MAC needs to process wide data at high speeds to meet the throughput needs. Altera recently introduced HyperFlexTM [1][2][3], a change to the fabric architecture aimed to facilitate massive pipelining of FPGA designs -- allowing them to run faster and hence alleviate the congestion that is caused by widening datapaths beyond 512b or 1024b. Though it seems counterintuitive it can be easier to close timing at 781 MHz for a 640b datapath than at 390 MHz for a 1280b datapath when wire congestion is taken into account. This presentation will discuss some of the practical details in implementing high-throughput protocols such as Ethernet and Interlaken, how we address these traditionally and how the design of the cores is modified with HyperPipelining. We will discuss alternative development styles for control and datapath logic, strategies for wire planning to avoid congestion, the throughput limits of FPGA routing networks, common timing closure issues and how to alleviate them, and how to pipeline intelligently. This presentation is thus partly a tutorial in the issues of making a 400G FPGA design close timing, and partly a case study of using HyperFlex on an FPGA design.","PeriodicalId":438572,"journal":{"name":"Proceedings of the 2016 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays","volume":"74 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-02-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121330591","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

Going Deeper with Embedded FPGA Platform for Convolutional Neural Network 深入探讨卷积神经网络的嵌入式FPGA平台

Proceedings of the 2016 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays Pub Date : 2016-02-21 DOI: 10.1145/2847263.2847265

Jiantao Qiu, Jie Wang, Song Yao, Kaiyuan Guo, Boxun Li, Erjin Zhou, Jincheng Yu, Tianqi Tang, Ningyi Xu, Sen Song, Yu Wang, Huazhong Yang

{"title":"Going Deeper with Embedded FPGA Platform for Convolutional Neural Network","authors":"Jiantao Qiu, Jie Wang, Song Yao, Kaiyuan Guo, Boxun Li, Erjin Zhou, Jincheng Yu, Tianqi Tang, Ningyi Xu, Sen Song, Yu Wang, Huazhong Yang","doi":"10.1145/2847263.2847265","DOIUrl":"https://doi.org/10.1145/2847263.2847265","url":null,"abstract":"In recent years, convolutional neural network (CNN) based methods have achieved great success in a large number of applications and have been among the most powerful and widely used techniques in computer vision. However, CNN-based methods are com-putational-intensive and resource-consuming, and thus are hard to be integrated into embedded systems such as smart phones, smart glasses, and robots. FPGA is one of the most promising platforms for accelerating CNN, but the limited bandwidth and on-chip memory size limit the performance of FPGA accelerator for CNN. In this paper, we go deeper with the embedded FPGA platform on accelerating CNNs and propose a CNN accelerator design on embedded FPGA for Image-Net large-scale image classification. We first present an in-depth analysis of state-of-the-art CNN models and show that Convolutional layers are computational-centric and Fully-Connected layers are memory-centric. Then the dynamic-precision data quantization method and a convolver design that is efficient for all layer types in CNN are proposed to improve the bandwidth and resource utilization. Results show that only 0.4% accuracy loss is introduced by our data quantization flow for the very deep VGG16 model when 8/4-bit quantization is used. A data arrangement method is proposed to further ensure a high utilization of the external memory bandwidth. Finally, a state-of-the-art CNN, VGG16-SVD, is implemented on an embedded FPGA platform as a case study. VGG16-SVD is the largest and most accurate network that has been implemented on FPGA end-to-end so far. The system on Xilinx Zynq ZC706 board achieves a frame rate at 4.45 fps with the top-5 accuracy of 86.66% using 16-bit quantization. The average performance of convolutional layers and the full CNN is 187.8 GOP/s and 137.0 GOP/s under 150MHz working frequency, which outperform previous approaches significantly.","PeriodicalId":438572,"journal":{"name":"Proceedings of the 2016 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays","volume":"85 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-02-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121712811","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1051

Session details: Technical Session 3: Circuit Design, Graph Processing Applications 技术部分3:电路设计，图形处理应用

Proceedings of the 2016 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays Pub Date : 2016-02-21 DOI: 10.1145/3250861

M. Hutton

引用次数: 0

Session details: Technical Session 1: Neural Networks and OpenCL 技术会议1:神经网络和OpenCL

Proceedings of the 2016 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays Pub Date : 2016-02-21 DOI: 10.1145/3250859

J. Anderson

引用次数: 0

Optimal Circuits for Streamed Linear Permutations Using RAM 基于RAM的流线性排列优化电路

Proceedings of the 2016 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays Pub Date : 2016-02-21 DOI: 10.1145/2847263.2847277

F. Serre, Thomas Holenstein, Markus Püschel

引用次数: 15

Session details: Technical Session 8: Applications 技术部分8:应用

Proceedings of the 2016 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays Pub Date : 2016-02-21 DOI: 10.1145/3250867

G. Constantinides

引用次数: 0

Towards PVT-Tolerant Glitch-Free Operation in FPGAs fpga的耐pvt无故障操作

Proceedings of the 2016 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays Pub Date : 2016-02-21 DOI: 10.1145/2847263.2847272

Safeen Huda, J. Anderson

引用次数: 7