2017 IEEE 25th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)最新文献

筛选
英文 中文
FP-DNN: An Automated Framework for Mapping Deep Neural Networks onto FPGAs with RTL-HLS Hybrid Templates FP-DNN:基于RTL-HLS混合模板将深度神经网络映射到fpga的自动化框架
Yijin Guan, Hao Liang, Ningyi Xu, Wenqiang Wang, Shaoshuai Shi, Xi Chen, Guangyu Sun, Wei Zhang, J. Cong
{"title":"FP-DNN: An Automated Framework for Mapping Deep Neural Networks onto FPGAs with RTL-HLS Hybrid Templates","authors":"Yijin Guan, Hao Liang, Ningyi Xu, Wenqiang Wang, Shaoshuai Shi, Xi Chen, Guangyu Sun, Wei Zhang, J. Cong","doi":"10.1109/FCCM.2017.25","DOIUrl":"https://doi.org/10.1109/FCCM.2017.25","url":null,"abstract":"DNNs (Deep Neural Networks) have demonstrated great success in numerous applications such as image classification, speech recognition, video analysis, etc. However, DNNs are much more computation-intensive and memory-intensive than previous shallow models. Thus, it is challenging to deploy DNNs in both large-scale data centers and real-time embedded systems. Considering performance, flexibility, and energy efficiency, FPGA-based accelerator for DNNs is a promising solution. Unfortunately, conventional accelerator design flows make it difficult for FPGA developers to keep up with the fast pace of innovations in DNNs. To overcome this problem, we propose FP-DNN (Field Programmable DNN), an end-to-end framework that takes TensorFlow-described DNNs as input, and automatically generates the hardware implementations on FPGA boards with RTL-HLS hybrid templates. FP-DNN performs model inference of DNNs with our high-performance computation engine and carefully-designed communication optimization strategies. We implement CNNs, LSTM-RNNs, and Residual Nets with FPDNN, and experimental results show the great performance and flexibility provided by our proposed FP-DNN framework.","PeriodicalId":124631,"journal":{"name":"2017 IEEE 25th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)","volume":"5 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125582950","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 251
A Configurable FPGA Implementation of the Tanh Function Using DCT Interpolation 用DCT插值实现Tanh函数的可配置FPGA
A. Abdelsalam, J. Langlois, F. Cheriet
{"title":"A Configurable FPGA Implementation of the Tanh Function Using DCT Interpolation","authors":"A. Abdelsalam, J. Langlois, F. Cheriet","doi":"10.1109/FCCM.2017.12","DOIUrl":"https://doi.org/10.1109/FCCM.2017.12","url":null,"abstract":"Efficient implementation of non-linear activationfunctions is essential to the implementation of deep learningmodels on FPGAs. We introduce such an implementation basedon the Discrete Cosine Transform Interpolation Filter (DCTIF). The proposed interpolation architecture combines simple arithmeticoperations on the stored samples of the hyperbolic tangentfunction and on input data. It achieves almost 3 better precisionthan previous works while using a similar amount computationalresources and a small amount of memory. Various combinationsof DCTIF parameters can be chosen to trade off the accuracy andthe overall circuit complexity of the tanh function. In one case, the proposed architecture approximates the hyperbolic tangentactivation function with 0.004 maximum error while requiringonly 1.45 kbits BRAM memory and 21 LUTs of a Virtex-7 FPGA.","PeriodicalId":124631,"journal":{"name":"2017 IEEE 25th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)","volume":"18 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123359064","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 13
Exploring High Efficiency Hardware Accelerator for the Key Algorithm of Square Kilometer Array Telescope Data Processing 探索平方公里阵列望远镜数据处理关键算法的高效硬件加速器
Qian Wu, Yongxin Zhu, Xu Wang, Mengjun Li, Junjie Hou, A. Masoumi
{"title":"Exploring High Efficiency Hardware Accelerator for the Key Algorithm of Square Kilometer Array Telescope Data Processing","authors":"Qian Wu, Yongxin Zhu, Xu Wang, Mengjun Li, Junjie Hou, A. Masoumi","doi":"10.1109/FCCM.2017.32","DOIUrl":"https://doi.org/10.1109/FCCM.2017.32","url":null,"abstract":"The SKA (Square Kilometer Array) radio telescope under construction will become the largest telescope in the world by integrating the sampled data from a huge number of small antenna nodes in the array to emulate a giant antenna. Due to the limited storage space, the SKA needs to process massive data in real-time, which makes the SKA scientific data processing become a bottleneck of the computational performance. However, existing off-the-shelf high performance computing solutions cannot meet the computation requirements (5 times more than top 1 supercomputer) as well as low power budget (1/3 of the power of the top 1 supercomputer). In this paper, we explore high efficiency solution design based on FPGA by addressing the most representative key algorithm in SKA data processing, i.e. Gridding, which is the most time and memory consuming. We propose an efficient hardware accelerator design of Gridding algorithm on FPGA, which would the first FPGA based design of Gridding algorithm in this community. In our design, we unfold the third loop in the Gridding algorithm and design corresponding hardware pipeline stages to achieve the high efficiency hardware acceleration. The functionality and performance of our design is verified in both simulation and FPGA prototyping board, whose results show that our proposed hardware implementation achieved great improvement in performance compared with software implementation running on generic CPUs. We believe our design would be a strong candidate design to solve the bottleneck in SKA data processing.","PeriodicalId":124631,"journal":{"name":"2017 IEEE 25th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)","volume":"120 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123120392","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
Fine-Grained Acceleration of Binary Neural Networks Using Intel® Xeon® Processor with Integrated FPGA 基于Intel®Xeon®处理器和集成FPGA的二进制神经网络的细粒度加速
Philip Colangelo, Randy Huang, Enno Lübbers, M. Margala, Kevin Nealis
{"title":"Fine-Grained Acceleration of Binary Neural Networks Using Intel® Xeon® Processor with Integrated FPGA","authors":"Philip Colangelo, Randy Huang, Enno Lübbers, M. Margala, Kevin Nealis","doi":"10.1109/FCCM.2017.46","DOIUrl":"https://doi.org/10.1109/FCCM.2017.46","url":null,"abstract":"Binary weighted networks (BWN) for imageclassification reduce computation for convolutional neuralnetworks (CNN) from multiply-adds to accumulates with littleto no accuracy loss. Hardware architectures such as FPGA cantake full advantage of BWN computations because of theirability to express weights represented as 0 and 1 efficientlythrough customizable logic. In this paper, we present animplementation on Intel®'s Xeon® processor with integratedFPGA to accelerate binary weighted networks. We interfaceIntel's Accelerator Abstraction Layer (AAL) with Caffe toprovide a robust framework used for accelerating CNN. Utilizing the low latency Quick Path Interconnect (QPI) between the Broadwell Xeon® processor and Arria10 FPGA, we can perform fine-grained offloads for specific portions ofthe network. Due to convolution layers making up most of thecomputation in our experiments, we offload the feature andweight data to customized binary hardware in the FPGA forfaster execution. An initial proof of concept design shows thatby using both the Xeon processor and FPGA together we canimprove the throughput by 2x on some layers and by 1.3xoverall while utilizing only a small percentage of FPGA core logic.","PeriodicalId":124631,"journal":{"name":"2017 IEEE 25th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)","volume":"70 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129118337","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 6
SWiF: A Simplified Workload-Centric Framework for FPGA-Based Computing SWiF:一个简化的以工作负载为中心的fpga计算框架
David Ojika, P. Majcher, Wojciech Neubauer, S. Subhaschandra, D. Acosta
{"title":"SWiF: A Simplified Workload-Centric Framework for FPGA-Based Computing","authors":"David Ojika, P. Majcher, Wojciech Neubauer, S. Subhaschandra, D. Acosta","doi":"10.1109/FCCM.2017.52","DOIUrl":"https://doi.org/10.1109/FCCM.2017.52","url":null,"abstract":"In this paper, we introduce SWiF – Simplified Workload-intuitive Framework – a workload-centric, application programming framework designed to simplify the large-scale deployment of FPGAs in end-to-end applications. SWiF can intelligently mediate access to shared resources by orchestrating the distribution and scheduling of tasks across a heterogeneous mix of FPGA and CPU resources in order to improve utilization and maintain system requirements. We implemented SWiF atop Intel Accelerator Abstraction Layer (AAL) and deployed the resulting software stack in a datacenter with an Intel-based Xeon+FPGA server running Apache Spark. We demonstrate that by using SWiF's API, developers can flexibly and easily deploy FPGA-enabled applications and frameworks with almost no change to existing software stack. In particular, we demonstrate that by offloading through SWiF the compression workload of Spark unto FPGA, we gain a speedup of 3.2X in total job execution, and up to 5X when Spark's Resilient Distributed Datasets (RDDs) are persisted in memory.","PeriodicalId":124631,"journal":{"name":"2017 IEEE 25th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)","volume":"65 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124851839","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
TLegUp: A TMR Code Generation Tool for SRAM-Based FPGA Applications Using HLS TLegUp:一个使用HLS的基于sram的FPGA应用的TMR代码生成工具
Ganghee Lee, D. Agiakatsikas, Tong Wu, E. Çetin, O. Diessel
{"title":"TLegUp: A TMR Code Generation Tool for SRAM-Based FPGA Applications Using HLS","authors":"Ganghee Lee, D. Agiakatsikas, Tong Wu, E. Çetin, O. Diessel","doi":"10.1109/FCCM.2017.57","DOIUrl":"https://doi.org/10.1109/FCCM.2017.57","url":null,"abstract":"We present TLegUp, an extension of LegUp, that automatically generates Triple Modular Redundant designs for FPGAs from C programs. TLegUp is expected to improve the productivity of application designers for space, to allow designers to experiment with alternative application partitioning, voter insertion and fault-tolerant aware scheduling and binding algorithms, and to support the automatic insertion of the infrastructure needed to run a fault-tolerant system. In this paper, we examine TLegUp's capacity to make use of both combinational and sequential voters by triplicating a design before scheduling and binding occur. In contrast, traditional RTL-based tools are constrained to use only combinational voters so as to preserve the scheduling and binding of the design, critical path lengths are consequently increased. We compare the use of sequential and combinational voters for a range of benchmarks implemented on a Xilinx Virtex-6 FPGA in terms of: (i) maximum operating frequency, (ii) latency, (iii) execution time, and (iv) soft-error sensitivity. Compared to the use of combinational voters, the use of sequential voters reduces the application execution time on the CHStone benchmark suite by 4% on average.","PeriodicalId":124631,"journal":{"name":"2017 IEEE 25th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)","volume":"86 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122878109","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 19
Minimalist Design for Accelerating Convolutional Neural Networks for Low-End FPGA Platforms 低端FPGA平台加速卷积神经网络的极简设计
Raghid Morcel, Haitham Akkary, Hazem M. Hajj, M. Saghir, A. Keshavamurthy, R. Khanna, H. Artail
{"title":"Minimalist Design for Accelerating Convolutional Neural Networks for Low-End FPGA Platforms","authors":"Raghid Morcel, Haitham Akkary, Hazem M. Hajj, M. Saghir, A. Keshavamurthy, R. Khanna, H. Artail","doi":"10.1109/FCCM.2017.62","DOIUrl":"https://doi.org/10.1109/FCCM.2017.62","url":null,"abstract":"Deep neural networks have gained tremendous attention in both the academic and industrial communities due to their performance in many artificial intelligence applications, particularly in computer vision. However, these algorithms are known to be computationally very demanding for both scoring and model learning applications. State-of-the-art recognition models use tens of millions of parameters and have significant memory and computational requirements. These requirements have restricted the users of deep neural network applications to high-end, expensive, and power hungry IoT platforms to penetrate the deep learning markets. This paper presents work at the leading edge intersection of several evolving technologies, including emerging IoT platforms, Deep Learning, and Field-programmable Gate Array (FPGA) computing. We demonstrate a new minimalist design methodology that minimizes the utilization of FPGA resources and can run deep learning algorithms with over 60 million parameters. This makes particularly suitable for resource-constrained, low-end FPGA platforms.","PeriodicalId":124631,"journal":{"name":"2017 IEEE 25th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)","volume":"2016 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129049250","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 11
Implementing FPGA Overlay NoCs Using the Xilinx UltraScale Memory Cascades 使用赛灵思超大级内存级联实现FPGA覆盖noc
Nachiket Kapre
{"title":"Implementing FPGA Overlay NoCs Using the Xilinx UltraScale Memory Cascades","authors":"Nachiket Kapre","doi":"10.1109/FCCM.2017.15","DOIUrl":"https://doi.org/10.1109/FCCM.2017.15","url":null,"abstract":"We can enhance the performance and efficiency of deflection-routed FPGA overlay NoCs by exploiting the cascading featureof the Xilinx UltraScale BlockRAMs. This allows us to (1) hardenthe multiplexers in the NoC switch crossbars, and (2) efficientlyadd buffering support to deflection-routing. While buffering isnot required for correct operation of a deflection routed NoC, it can boost network throughputs for large system sizes underheavy load and allow functional support for fixed-length, multi-flit NoC traffic. Since the multiplexer controls of the cascadedRAMs can be driven from user-logic, the NoC routing functioncan be implementing in LUTs while the data is steered acrossthe dedicated cascade multiplexers and links. Thus, our approachuses hard resources in the BlockRAM architecture to absorb thebulk of the cost of a NoC in the form of crossbar multiplexing, as well as packet queuing. For the XCVU9P UltraScale+ FPGA, we show how to map the 72b Hoplite NoC router at a cost of 3FIFO blocks, 64 LUTs, and 40 FFs per switch while operating at ≈727 MHz (400 MHz in 60×12 grid). This reduces LUT count by1.4× and FF cost by 2× over a pure LUT-based implementationwhile also being 1.2× faster. For uniform RANDOM traffic, weboost throughputs of a 16×16 NoC by 50–60%, reduce worst-case packet latency by ≈40%, and lower energy use by 10–40%over classic bufferless deflection-routing at injection rates of 15–20% and higher with 16-deep buffers. When compared to hardNoC router designs, our BRAM-based soft NoC also closes thearea gap to under a factor of two instead of the 20–23× gapclaimed in earlier studies.","PeriodicalId":124631,"journal":{"name":"2017 IEEE 25th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)","volume":"47 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133626342","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 8
Automata-to-Routing: An Open-Source Toolchain for Design-Space Exploration of Spatial Automata Processing Architectures 自动机到路由:空间自动机处理体系结构设计空间探索的开源工具链
J. Wadden, S. Khan, K. Skadron
{"title":"Automata-to-Routing: An Open-Source Toolchain for Design-Space Exploration of Spatial Automata Processing Architectures","authors":"J. Wadden, S. Khan, K. Skadron","doi":"10.1109/FCCM.2017.38","DOIUrl":"https://doi.org/10.1109/FCCM.2017.38","url":null,"abstract":"Newly-available spatial architectures to accelerate finite-automata processing have spurred research and development on novel automata-based applications. However, spatial automata processing architecture research is lacking, because of a lack of automata optimization and place-and-route tools. To solve this issue, we propose a new, open-source toolchain–Automata-to-Routing (ATR)–that enables design-space exploration of spatial automata architectures. ATR leverages existing open-source tools for both automata processing and FPGA architecture research. To demonstrate the usefulness of this new toolchain, we use it to analyze design choices of spatial automata processing architectures. We first show that ATR is capable of modeling the logic tiles of a commercially-available spatial automata processing architecture. We then use ATR to compare and contrast two different routing architecture methodologies–hierarchical and 2D-mesh–over a set of diverse automata benchmarks. We show that shallower 2D-mesh-style routing fabrics can route complex automata with equal channel width, while using up to 4.2x fewer logic tile resources.","PeriodicalId":124631,"journal":{"name":"2017 IEEE 25th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)","volume":"27 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134632048","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 11
Efficient GPGPU Computing with Cross-Core Resource Sharing and Core Reconfiguration 基于跨核资源共享和核心重构的高效GPGPU计算
Ashutosh Dhar, Deming Chen
{"title":"Efficient GPGPU Computing with Cross-Core Resource Sharing and Core Reconfiguration","authors":"Ashutosh Dhar, Deming Chen","doi":"10.1109/FCCM.2017.59","DOIUrl":"https://doi.org/10.1109/FCCM.2017.59","url":null,"abstract":"GPUs are capable of running a variety of applications, however their generic parallel-architecture can lead to inefficient use of resources and reduced power efficiency, due to algorithmic or architectural constraints. In this work, taking inspiration from CGRAs (coarse-grained reconfigurable architectures), we demonstrate resource sharing and re-distribution as a solution that can be leveraged by reconfiguring the GPU on a kernel-by-kernel basis. We explore four different schemes that trade the number of active SMs (streaming multiprocessor) for increased occupancy and local memory resources per SM and demonstrate improved power and energy with limited impact to performance. Our most aggressive scheme, BigSM, is capable of saving energy by up to 54%, and 26% on an average.","PeriodicalId":124631,"journal":{"name":"2017 IEEE 25th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)","volume":"357 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114658233","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
相关产品
×
本文献相关产品
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信