Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays最新文献_第6页

A Batch Normalization Free Binarized Convolutional Deep Neural Network on an FPGA (Abstract Only) 基于FPGA的无批归一化二值化卷积深度神经网络(仅摘要)

Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays Pub Date : 2017-02-22 DOI: 10.1145/3020078.3021782

Hiroki Nakahara, H. Yonekawa, H. Iwamoto, M. Motomura

{"title":"A Batch Normalization Free Binarized Convolutional Deep Neural Network on an FPGA (Abstract Only)","authors":"Hiroki Nakahara, H. Yonekawa, H. Iwamoto, M. Motomura","doi":"10.1145/3020078.3021782","DOIUrl":"https://doi.org/10.1145/3020078.3021782","url":null,"abstract":"A pre-trained convolutional deep neural network (CNN) is a feed-forward computation perspective, which is widely used for the embedded systems, requires high power-and-area efficiency. This paper realizes a binarized CNN which treats only binary 2-values (+1/-1) for the inputs and the weights. In this case, the multiplier is replaced into an XNOR circuit instead of a dedicated DSP block. For hardware implementation, using binarized inputs and weights is more suitable. However, the binarized CNN requires the batch normalization techniques to retain the classification accuracy. In that case, the additional multiplication and addition require extra hardware, also, the memory access for its parameters reduces system performance. In this paper, we propose the batch normalization free CNN which is mathematically equivalent to the CNN using batch normalization. The proposed CNN treats the binarized inputs and weights with the integer bias. We implemented the VGG-16 benchmark CNN on the NetFPGA-SUME FPGA board, which has the Xilinx Inc. Virtex7 FPGA and three off-chip QDR II+ Synchronous SRAMs. Compared with the conventional FPGA realizations, although the classification error rate is 6.5% decayed, the performance is 2.82 times faster, the power efficiency is 1.76 times lower, and the area efficiency is 11.03 times smaller. Thus, our method is suitable for the embedded computer system.","PeriodicalId":252039,"journal":{"name":"Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays","volume":"16 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-02-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124504414","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 15

Session details: Virtualization and Applications 会话详细信息:虚拟化和应用

Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays Pub Date : 2017-02-22 DOI: 10.1145/3257191

J. Lockwood

引用次数: 0

Session details: Special Session: The Role of FPGAs in Deep Learning 专题会议:fpga在深度学习中的作用

Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays Pub Date : 2017-02-22 DOI: 10.1145/3257183

A. Ling

引用次数: 0

Session details: Architecture 会话细节:架构

Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays Pub Date : 2017-02-22 DOI: 10.1145/3257186

S. Wilton

引用次数: 0

Precise Coincidence Detection on FPGAs: Three Case Studies (Abstract Only) fpga的精确符合检测:三个案例研究(摘要)

Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays Pub Date : 2017-02-22 DOI: 10.1145/3020078.3021766

R. Salomon, R. Joost

引用次数: 0

Boosting the Performance of FPGA-based Graph Processor using Hybrid Memory Cube: A Case for Breadth First Search 利用混合记忆体提升基于fpga的图形处理器的效能:一种宽度优先搜寻的案例

Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays Pub Date : 2017-02-22 DOI: 10.1145/3020078.3021737

Jialiang Zhang, Soroosh Khoram, J. Li

{"title":"Boosting the Performance of FPGA-based Graph Processor using Hybrid Memory Cube: A Case for Breadth First Search","authors":"Jialiang Zhang, Soroosh Khoram, J. Li","doi":"10.1145/3020078.3021737","DOIUrl":"https://doi.org/10.1145/3020078.3021737","url":null,"abstract":"Large graph processing has gained great attention in recent years due to its broad applicability from machine learning to social science. Large real-world graphs, however, are inherently difficult to process efficiently, not only due to their large memory footprint, but also that most graph algorithms entail memory access patterns with poor locality and a low compute-to-memory access ratio. In this work, we leverage the exceptional random access performance of emerging Hybrid Memory Cube (HMC) technology that stacks multiple DRAM dies on top of a logic layer, combined with the flexibility and efficiency of FPGA to address these challenges. To our best knowledge, this is the first work that implements a graph processing system on a FPGA-HMC platform based on software/hardware co-design and co-optimization. We first present the modifications of algorithm and a platform-aware graph processing architecture to perform level-synchronized breadth first search (BFS) on FPGA-HMC platform. To gain better insights into the potential bottlenecks of proposed implementation, we develop an analytical performance model to quantitatively evaluate the HMC access latency and corresponding BFS performance. Based on the analysis, we propose a two-level bitmap scheme to further reduce memory access and perform optimization on key design parameters (e.g. memory access granularity). Finally, we evaluate the performance of our BFS implementation using the AC-510 development kit from Micron. We achieved 166 million edges traversed per second (MTEPS) using GRAPH500 benchmark on a random graph with a scale of 25 and an edge factor of 16, which significantly outperforms CPU and other FPGA-based large graph processors.","PeriodicalId":252039,"journal":{"name":"Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-02-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128027376","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 56

GRT 2.0: An FPGA-based SDR Platform for Cognitive Radio Networks (Abstract Only) 基于fpga的认知无线网络SDR平台GRT 2.0

Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays Pub Date : 2017-02-22 DOI: 10.1145/3020078.3021798

Haoyang Wu, Tao Wang, Zhiwei Li, Boyan Ding, Xiaoguang Li, Tianfu Jiang, Jun Liu, Songwu Lu

{"title":"GRT 2.0: An FPGA-based SDR Platform for Cognitive Radio Networks (Abstract Only)","authors":"Haoyang Wu, Tao Wang, Zhiwei Li, Boyan Ding, Xiaoguang Li, Tianfu Jiang, Jun Liu, Songwu Lu","doi":"10.1145/3020078.3021798","DOIUrl":"https://doi.org/10.1145/3020078.3021798","url":null,"abstract":"Although there is explosive growth of theoretical research on cognitive radio, the real-time platform for cognitive radio is progressing at a low pace. Researchers expect fast prototyping their designs with appropriate wireless platforms to precisely evaluate and validate their new designs. Platforms for cognitive radio should provide both high-performance and programmability. We observed that for the parallel and reconfigurable nature, FPGA is suitable for developing real-time software-defined radio (SDR) platforms. However, without a carefully designed \"middleware architecture layer\", Real-time programmable wireless system is still difficult to build. In this paper, we present GRT 2.0, a novel high-performance and programmable SDR platform for cognitive radio. This paper focuses on the architecture design of media access control (MAC) layer and radio frequency (RF) front-end interface. We allocate different MAC functions into different computing units, including a dedicated, light-weight embedded processor and several peripherals, to ensure both programmability and microsecond-level timing requirements. A serial-to-parallel converter is adopted to solve the issues of frame type matching and precise timing between PHY and RF. To support mobile host computers, we use the more portable USB 3.0 interface instead of PCIe. Finally, with the design of an efficient \"gain lock\" state machine, automatic gain control (AGC) processing time has been reduced to less than 1us. The evaluation result shows that with 802.11a/g protocol, GRT 2.0 achieves maximum throughput of 23Mbps in MAC, which is compatible to commodity fixed-logic wireless network adaptors. The latency of RF front-end is less than 2us, over 10X performance improvement to the Ethernet cable interface. Moreover, by carefully designed \"middleware architecture layer\" in FPGA, we provide good programmability both in MAC and PHY.","PeriodicalId":252039,"journal":{"name":"Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays","volume":"12 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-02-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132497663","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

FPGA-based Hardware Accelerator for Image Reconstruction in Magnetic Resonance Imaging (Abstract Only) 基于fpga的磁共振成像图像重建硬件加速器(仅摘要)

Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays Pub Date : 2017-02-22 DOI: 10.1145/3020078.3021793

Emanuele Pezzotti, A. Iacobucci, G. Nash, Umer I. Cheema, Paolo Vinella, R. Ansari

{"title":"FPGA-based Hardware Accelerator for Image Reconstruction in Magnetic Resonance Imaging (Abstract Only)","authors":"Emanuele Pezzotti, A. Iacobucci, G. Nash, Umer I. Cheema, Paolo Vinella, R. Ansari","doi":"10.1145/3020078.3021793","DOIUrl":"https://doi.org/10.1145/3020078.3021793","url":null,"abstract":"Magnetic Resonance Imaging (MRI) is widely used in medical diagnostics. Sampling of MRI data on Cartesian grids allows efficient computation of the Inverse Discrete Fourier Transform for image reconstruction using the Inverse Fast Fourier Transform (IFFT) algorithm. Though the use of Cartesian trajectories simplifies the IFFT computation, non-Cartesian trajectories have been shown to provide better image resolution with lower scan times. To improve the processing time of MRI image reconstruction for these optimized non-Cartesian trajectories using a Non-uniform Fast Fourier Transform (NuFFT) algorithm, dedicated accelerators are required. We present an FPGA-based MRI solution to implement NuFFT for image reconstruction. The solution is based on the design of an efficient custom accelerator on FPGA using OpenCL, and covers all the phases necessary to reconstruct an image with high accuracy, starting from raw scan data. The architecture can be easily extendable to tackle 3D imaging, and k-space properties have been analyzed to reduce the number of samples processed, achieving satisfactory reconstruction accuracy while positively impacting processing time. Our solution achieves a marked improvement over previously published FPGA- and CPU-based implementations and, due to its scalability, it is suitable for the image sizes common in MRI acquisitions.","PeriodicalId":252039,"journal":{"name":"Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays","volume":"39 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-02-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132024696","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

Hardware Synthesis of Weakly Consistent C Concurrency 弱一致C并发的硬件综合

Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays Pub Date : 2017-02-22 DOI: 10.1145/3020078.3021733

Nadesh Ramanathan, Shane T. Fleming, John Wickerson, G. Constantinides

{"title":"Hardware Synthesis of Weakly Consistent C Concurrency","authors":"Nadesh Ramanathan, Shane T. Fleming, John Wickerson, G. Constantinides","doi":"10.1145/3020078.3021733","DOIUrl":"https://doi.org/10.1145/3020078.3021733","url":null,"abstract":"Lock-free algorithms, in which threads synchronise not via coarse-grained mutual exclusion but via fine-grained atomic operations ('atomics'), have been shown empirically to be the fastest class of multi-threaded algorithms in the realm of conventional processors. This paper explores how these algorithms can be compiled from C to reconfigurable hardware via high-level synthesis (HLS). We focus on the scheduling problem, in which software instructions are assigned to hardware clock cycles. We first show that typical HLS scheduling constraints are insufficient to implement atomics, because they permit some instruction reorderings that, though sound in a single-threaded context, demonstrably cause erroneous results when synthesising multi-threaded programs. We then show that correct behaviour can be restored by imposing additional intra-thread constraints among the memory operations. We implement our approach in the open-source LegUp HLS framework, and provide both sequentially consistent (SC) and weakly consistent ('weak') atomics. Weak atomics necessitate fewer constraints than SC atomics, but suffice for many concurrent algorithms. We confirm, via automatic model-checking, that we correctly implement the semantics defined by the 2011 revision of the C standard. A case study on a circular buffer suggests that circuits synthesised from programs that use atomics can be 2.5x faster than those that use locks, and that weak atomics can yield a further 1.5x speedup.","PeriodicalId":252039,"journal":{"name":"Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-02-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125193547","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 7

Scala Based FPGA Design Flow (Abstract Only) 基于Scala的FPGA设计流程(仅摘要)

Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays Pub Date : 2017-02-22 DOI: 10.1145/3020078.3021762

Yanqiang Liu, Yao Li, Weilun Xiong, Meng Lai, Cheng Chen, Zhengwei Qi, Haibing Guan

{"title":"Scala Based FPGA Design Flow (Abstract Only)","authors":"Yanqiang Liu, Yao Li, Weilun Xiong, Meng Lai, Cheng Chen, Zhengwei Qi, Haibing Guan","doi":"10.1145/3020078.3021762","DOIUrl":"https://doi.org/10.1145/3020078.3021762","url":null,"abstract":"With the rapid growth of data scale, data analysis applications start to meet the performance bottleneck, and thus requiring the aid of hardware acceleration. At the same time, Field Programmable Gate Arrays (FPGAs), known for their high customizability and parallel nature, have gained momentum in the past decade. However, the efficiency of development for acceleration system based on FPGAs is severely constrained by the traditional languages and tools, due to their deficiency in expressibility, extendability, limited libraries and semantic gap between software and hardware design. This paper proposes a new open-source DSL based hardware design framework called VeriScala (https://github.com/VeriScala/VeriScala) that supports highly abstracted object-oriented hardware defining, programmatical testing, and interactive on-chip debugging. By adopting DSL embedded in Scala, we introduce modern software developing concepts into hardware designing including object-oriented programming, parameterized types, type safety, test automation, etc. VeriScala enables designers to describe their hardware designs in Scala, generate Verilog code automatically and interactively debug and test hardware design in real FPGA environment. Through the evaluation on real world applications and usability test, we show that VeriScala provides a practical approach for rapid prototyping of hardware acceleration systems. (This work is supported by the National Key Research & Development Program of China 2016YFB1000500)","PeriodicalId":252039,"journal":{"name":"Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays","volume":"41 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-02-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114711359","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 4