2017 IEEE 25th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)最新文献_第5页

FP-DNN: An Automated Framework for Mapping Deep Neural Networks onto FPGAs with RTL-HLS Hybrid Templates FP-DNN:基于RTL-HLS混合模板将深度神经网络映射到fpga的自动化框架

2017 IEEE 25th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM) Pub Date : 2017-04-01 DOI: 10.1109/FCCM.2017.25

Yijin Guan, Hao Liang, Ningyi Xu, Wenqiang Wang, Shaoshuai Shi, Xi Chen, Guangyu Sun, Wei Zhang, J. Cong

{"title":"FP-DNN: An Automated Framework for Mapping Deep Neural Networks onto FPGAs with RTL-HLS Hybrid Templates","authors":"Yijin Guan, Hao Liang, Ningyi Xu, Wenqiang Wang, Shaoshuai Shi, Xi Chen, Guangyu Sun, Wei Zhang, J. Cong","doi":"10.1109/FCCM.2017.25","DOIUrl":"https://doi.org/10.1109/FCCM.2017.25","url":null,"abstract":"DNNs (Deep Neural Networks) have demonstrated great success in numerous applications such as image classification, speech recognition, video analysis, etc. However, DNNs are much more computation-intensive and memory-intensive than previous shallow models. Thus, it is challenging to deploy DNNs in both large-scale data centers and real-time embedded systems. Considering performance, flexibility, and energy efficiency, FPGA-based accelerator for DNNs is a promising solution. Unfortunately, conventional accelerator design flows make it difficult for FPGA developers to keep up with the fast pace of innovations in DNNs. To overcome this problem, we propose FP-DNN (Field Programmable DNN), an end-to-end framework that takes TensorFlow-described DNNs as input, and automatically generates the hardware implementations on FPGA boards with RTL-HLS hybrid templates. FP-DNN performs model inference of DNNs with our high-performance computation engine and carefully-designed communication optimization strategies. We implement CNNs, LSTM-RNNs, and Residual Nets with FPDNN, and experimental results show the great performance and flexibility provided by our proposed FP-DNN framework.","PeriodicalId":124631,"journal":{"name":"2017 IEEE 25th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)","volume":"5 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125582950","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 251

A Configurable FPGA Implementation of the Tanh Function Using DCT Interpolation 用DCT插值实现Tanh函数的可配置FPGA

2017 IEEE 25th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM) Pub Date : 2017-04-01 DOI: 10.1109/FCCM.2017.12

A. Abdelsalam, J. Langlois, F. Cheriet

引用次数: 13

Exploring High Efficiency Hardware Accelerator for the Key Algorithm of Square Kilometer Array Telescope Data Processing 探索平方公里阵列望远镜数据处理关键算法的高效硬件加速器

2017 IEEE 25th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM) Pub Date : 2017-04-01 DOI: 10.1109/FCCM.2017.32

Qian Wu, Yongxin Zhu, Xu Wang, Mengjun Li, Junjie Hou, A. Masoumi

{"title":"Exploring High Efficiency Hardware Accelerator for the Key Algorithm of Square Kilometer Array Telescope Data Processing","authors":"Qian Wu, Yongxin Zhu, Xu Wang, Mengjun Li, Junjie Hou, A. Masoumi","doi":"10.1109/FCCM.2017.32","DOIUrl":"https://doi.org/10.1109/FCCM.2017.32","url":null,"abstract":"The SKA (Square Kilometer Array) radio telescope under construction will become the largest telescope in the world by integrating the sampled data from a huge number of small antenna nodes in the array to emulate a giant antenna. Due to the limited storage space, the SKA needs to process massive data in real-time, which makes the SKA scientific data processing become a bottleneck of the computational performance. However, existing off-the-shelf high performance computing solutions cannot meet the computation requirements (5 times more than top 1 supercomputer) as well as low power budget (1/3 of the power of the top 1 supercomputer). In this paper, we explore high efficiency solution design based on FPGA by addressing the most representative key algorithm in SKA data processing, i.e. Gridding, which is the most time and memory consuming. We propose an efficient hardware accelerator design of Gridding algorithm on FPGA, which would the first FPGA based design of Gridding algorithm in this community. In our design, we unfold the third loop in the Gridding algorithm and design corresponding hardware pipeline stages to achieve the high efficiency hardware acceleration. The functionality and performance of our design is verified in both simulation and FPGA prototyping board, whose results show that our proposed hardware implementation achieved great improvement in performance compared with software implementation running on generic CPUs. We believe our design would be a strong candidate design to solve the bottleneck in SKA data processing.","PeriodicalId":124631,"journal":{"name":"2017 IEEE 25th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)","volume":"120 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123120392","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 4

Fine-Grained Acceleration of Binary Neural Networks Using Intel® Xeon® Processor with Integrated FPGA 基于Intel®Xeon®处理器和集成FPGA的二进制神经网络的细粒度加速

2017 IEEE 25th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM) Pub Date : 2017-04-01 DOI: 10.1109/FCCM.2017.46

Philip Colangelo, Randy Huang, Enno Lübbers, M. Margala, Kevin Nealis

{"title":"Fine-Grained Acceleration of Binary Neural Networks Using Intel® Xeon® Processor with Integrated FPGA","authors":"Philip Colangelo, Randy Huang, Enno Lübbers, M. Margala, Kevin Nealis","doi":"10.1109/FCCM.2017.46","DOIUrl":"https://doi.org/10.1109/FCCM.2017.46","url":null,"abstract":"Binary weighted networks (BWN) for imageclassification reduce computation for convolutional neuralnetworks (CNN) from multiply-adds to accumulates with littleto no accuracy loss. Hardware architectures such as FPGA cantake full advantage of BWN computations because of theirability to express weights represented as 0 and 1 efficientlythrough customizable logic. In this paper, we present animplementation on Intel®'s Xeon® processor with integratedFPGA to accelerate binary weighted networks. We interfaceIntel's Accelerator Abstraction Layer (AAL) with Caffe toprovide a robust framework used for accelerating CNN. Utilizing the low latency Quick Path Interconnect (QPI) between the Broadwell Xeon® processor and Arria10 FPGA, we can perform fine-grained offloads for specific portions ofthe network. Due to convolution layers making up most of thecomputation in our experiments, we offload the feature andweight data to customized binary hardware in the FPGA forfaster execution. An initial proof of concept design shows thatby using both the Xeon processor and FPGA together we canimprove the throughput by 2x on some layers and by 1.3xoverall while utilizing only a small percentage of FPGA core logic.","PeriodicalId":124631,"journal":{"name":"2017 IEEE 25th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)","volume":"70 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129118337","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 6

SWiF: A Simplified Workload-Centric Framework for FPGA-Based Computing SWiF:一个简化的以工作负载为中心的fpga计算框架

2017 IEEE 25th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM) Pub Date : 2017-04-01 DOI: 10.1109/FCCM.2017.52

David Ojika, P. Majcher, Wojciech Neubauer, S. Subhaschandra, D. Acosta

引用次数: 0

TLegUp: A TMR Code Generation Tool for SRAM-Based FPGA Applications Using HLS TLegUp:一个使用HLS的基于sram的FPGA应用的TMR代码生成工具

2017 IEEE 25th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM) Pub Date : 2017-04-01 DOI: 10.1109/FCCM.2017.57

Ganghee Lee, D. Agiakatsikas, Tong Wu, E. Çetin, O. Diessel

{"title":"TLegUp: A TMR Code Generation Tool for SRAM-Based FPGA Applications Using HLS","authors":"Ganghee Lee, D. Agiakatsikas, Tong Wu, E. Çetin, O. Diessel","doi":"10.1109/FCCM.2017.57","DOIUrl":"https://doi.org/10.1109/FCCM.2017.57","url":null,"abstract":"We present TLegUp, an extension of LegUp, that automatically generates Triple Modular Redundant designs for FPGAs from C programs. TLegUp is expected to improve the productivity of application designers for space, to allow designers to experiment with alternative application partitioning, voter insertion and fault-tolerant aware scheduling and binding algorithms, and to support the automatic insertion of the infrastructure needed to run a fault-tolerant system. In this paper, we examine TLegUp's capacity to make use of both combinational and sequential voters by triplicating a design before scheduling and binding occur. In contrast, traditional RTL-based tools are constrained to use only combinational voters so as to preserve the scheduling and binding of the design, critical path lengths are consequently increased. We compare the use of sequential and combinational voters for a range of benchmarks implemented on a Xilinx Virtex-6 FPGA in terms of: (i) maximum operating frequency, (ii) latency, (iii) execution time, and (iv) soft-error sensitivity. Compared to the use of combinational voters, the use of sequential voters reduces the application execution time on the CHStone benchmark suite by 4% on average.","PeriodicalId":124631,"journal":{"name":"2017 IEEE 25th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)","volume":"86 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122878109","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 19

Minimalist Design for Accelerating Convolutional Neural Networks for Low-End FPGA Platforms 低端FPGA平台加速卷积神经网络的极简设计

2017 IEEE 25th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM) Pub Date : 2017-04-01 DOI: 10.1109/FCCM.2017.62

Raghid Morcel, Haitham Akkary, Hazem M. Hajj, M. Saghir, A. Keshavamurthy, R. Khanna, H. Artail

{"title":"Minimalist Design for Accelerating Convolutional Neural Networks for Low-End FPGA Platforms","authors":"Raghid Morcel, Haitham Akkary, Hazem M. Hajj, M. Saghir, A. Keshavamurthy, R. Khanna, H. Artail","doi":"10.1109/FCCM.2017.62","DOIUrl":"https://doi.org/10.1109/FCCM.2017.62","url":null,"abstract":"Deep neural networks have gained tremendous attention in both the academic and industrial communities due to their performance in many artificial intelligence applications, particularly in computer vision. However, these algorithms are known to be computationally very demanding for both scoring and model learning applications. State-of-the-art recognition models use tens of millions of parameters and have significant memory and computational requirements. These requirements have restricted the users of deep neural network applications to high-end, expensive, and power hungry IoT platforms to penetrate the deep learning markets. This paper presents work at the leading edge intersection of several evolving technologies, including emerging IoT platforms, Deep Learning, and Field-programmable Gate Array (FPGA) computing. We demonstrate a new minimalist design methodology that minimizes the utilization of FPGA resources and can run deep learning algorithms with over 60 million parameters. This makes particularly suitable for resource-constrained, low-end FPGA platforms.","PeriodicalId":124631,"journal":{"name":"2017 IEEE 25th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)","volume":"2016 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129049250","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 11

Implementing FPGA Overlay NoCs Using the Xilinx UltraScale Memory Cascades 使用赛灵思超大级内存级联实现FPGA覆盖noc

2017 IEEE 25th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM) Pub Date : 2017-04-01 DOI: 10.1109/FCCM.2017.15

Nachiket Kapre

{"title":"Implementing FPGA Overlay NoCs Using the Xilinx UltraScale Memory Cascades","authors":"Nachiket Kapre","doi":"10.1109/FCCM.2017.15","DOIUrl":"https://doi.org/10.1109/FCCM.2017.15","url":null,"abstract":"We can enhance the performance and efficiency of deflection-routed FPGA overlay NoCs by exploiting the cascading featureof the Xilinx UltraScale BlockRAMs. This allows us to (1) hardenthe multiplexers in the NoC switch crossbars, and (2) efficientlyadd buffering support to deflection-routing. While buffering isnot required for correct operation of a deflection routed NoC, it can boost network throughputs for large system sizes underheavy load and allow functional support for fixed-length, multi-flit NoC traffic. Since the multiplexer controls of the cascadedRAMs can be driven from user-logic, the NoC routing functioncan be implementing in LUTs while the data is steered acrossthe dedicated cascade multiplexers and links. Thus, our approachuses hard resources in the BlockRAM architecture to absorb thebulk of the cost of a NoC in the form of crossbar multiplexing, as well as packet queuing. For the XCVU9P UltraScale+ FPGA, we show how to map the 72b Hoplite NoC router at a cost of 3FIFO blocks, 64 LUTs, and 40 FFs per switch while operating at ≈727 MHz (400 MHz in 60×12 grid). This reduces LUT count by1.4× and FF cost by 2× over a pure LUT-based implementationwhile also being 1.2× faster. For uniform RANDOM traffic, weboost throughputs of a 16×16 NoC by 50–60%, reduce worst-case packet latency by ≈40%, and lower energy use by 10–40%over classic bufferless deflection-routing at injection rates of 15–20% and higher with 16-deep buffers. When compared to hardNoC router designs, our BRAM-based soft NoC also closes thearea gap to under a factor of two instead of the 20–23× gapclaimed in earlier studies.","PeriodicalId":124631,"journal":{"name":"2017 IEEE 25th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)","volume":"47 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133626342","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 8

Automata-to-Routing: An Open-Source Toolchain for Design-Space Exploration of Spatial Automata Processing Architectures 自动机到路由:空间自动机处理体系结构设计空间探索的开源工具链

2017 IEEE 25th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM) Pub Date : 2017-04-01 DOI: 10.1109/FCCM.2017.38

J. Wadden, S. Khan, K. Skadron

{"title":"Automata-to-Routing: An Open-Source Toolchain for Design-Space Exploration of Spatial Automata Processing Architectures","authors":"J. Wadden, S. Khan, K. Skadron","doi":"10.1109/FCCM.2017.38","DOIUrl":"https://doi.org/10.1109/FCCM.2017.38","url":null,"abstract":"Newly-available spatial architectures to accelerate finite-automata processing have spurred research and development on novel automata-based applications. However, spatial automata processing architecture research is lacking, because of a lack of automata optimization and place-and-route tools. To solve this issue, we propose a new, open-source toolchain–Automata-to-Routing (ATR)–that enables design-space exploration of spatial automata architectures. ATR leverages existing open-source tools for both automata processing and FPGA architecture research. To demonstrate the usefulness of this new toolchain, we use it to analyze design choices of spatial automata processing architectures. We first show that ATR is capable of modeling the logic tiles of a commercially-available spatial automata processing architecture. We then use ATR to compare and contrast two different routing architecture methodologies–hierarchical and 2D-mesh–over a set of diverse automata benchmarks. We show that shallower 2D-mesh-style routing fabrics can route complex automata with equal channel width, while using up to 4.2x fewer logic tile resources.","PeriodicalId":124631,"journal":{"name":"2017 IEEE 25th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)","volume":"27 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134632048","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 11

Efficient GPGPU Computing with Cross-Core Resource Sharing and Core Reconfiguration 基于跨核资源共享和核心重构的高效GPGPU计算

2017 IEEE 25th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM) Pub Date : 2017-04-01 DOI: 10.1109/FCCM.2017.59

Ashutosh Dhar, Deming Chen

引用次数: 2