2019 IEEE 27th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)最新文献

An Efficient Hardware Accelerator for Sparse Convolutional Neural Networks on FPGAs 基于fpga的稀疏卷积神经网络硬件加速器

2019 IEEE 27th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM) Pub Date : 2019-04-01 DOI: 10.1109/FCCM.2019.00013

Liqiang Lu, Jiaming Xie, Ruirui Huang, Jiansong Zhang, Wei Lin, Yun Liang

{"title":"An Efficient Hardware Accelerator for Sparse Convolutional Neural Networks on FPGAs","authors":"Liqiang Lu, Jiaming Xie, Ruirui Huang, Jiansong Zhang, Wei Lin, Yun Liang","doi":"10.1109/FCCM.2019.00013","DOIUrl":"https://doi.org/10.1109/FCCM.2019.00013","url":null,"abstract":"Deep convolutional neural networks (CNN) have achieved remarkable performance with the cost of huge computation. As the CNN model becomes more complex and deeper, compressing CNN to sparse by pruning the redundant connection in networks has emerged as an attractive approach to reduce the amount of computation and memory requirement. In recent years, FPGAs have been demonstrated to be an effective hardware platform to accelerate CNN inference. However, most existing FPGA architectures focus on dense CNN models. The architecture designed for dense CNN models are inefficient when executing sparse models as most of the arithmetic operations involve addition and multiplication with zero operands. On the other hand, recent sparse FPGA accelerators only focus on FC layers. In this work, we aim to develop an FPGA accelerator for sparse CNNs. To efficiently deal with the irregular connection in the sparse convolutional layer, we propose a weight-oriented dataflow that processes each weight individually. Then we design an FPGA architecture which can handle input-weight connection and weight-output connection efficiently. For input-weight connection, we design a tile look-up table to eliminate the runtime indexing match of compressed weights. Moreover, we develop a weight layout to enable high on-chip memory access. To cooperate with the weight layout, a channel multiplexer is inserted to locate the address which can ensure no data access conflict. Experiments demonstrate that our accelerator can achieve 223.4-309.0 GOP/s for the modern CNNs on Xilinx ZCU102, which provides a 3.6x-12.9x speedup over previous dense CNN FPGA accelerators.","PeriodicalId":116955,"journal":{"name":"2019 IEEE 27th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)","volume":"36 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122453008","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 88

A High Throughput and Energy-Efficient Retina-Inspired Tone Mapping Processor 一种高通量、高能效视网膜色调映射处理器

2019 IEEE 27th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM) Pub Date : 2019-04-01 DOI: 10.1109/FCCM.2019.00062

Lili Liu, Xiaoqiang Xiang, Yuxiang Xie, Yongjie Li, Bo Yan, Jun Zhou

引用次数: 1

Automated Tool and Runtime Support for Fine-Grain Reconfiguration in Highly Flexible Reconfigurable Systems 高度灵活的可重构系统中精细重构的自动化工具和运行时支持

2019 IEEE 27th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM) Pub Date : 2019-04-01 DOI: 10.1109/FCCM.2019.00048

Rafael Zamacola, A. García-Martínez, J. Mora, A. Otero, E. D. L. Torre

引用次数: 3

SimAcc: A Configurable Cycle-Accurate Simulator for Customized Accelerators on CPU-FPGAs SoCs SimAcc:用于cpu - fpga soc上定制加速器的可配置周期精确模拟器

2019 IEEE 27th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM) Pub Date : 2019-04-01 DOI: 10.1109/FCCM.2019.00031

Konstantinos Iordanou, Oscar Palomar, John Mawer, Cosmin Gorgovan, A. Nisbet, M. Luján

{"title":"SimAcc: A Configurable Cycle-Accurate Simulator for Customized Accelerators on CPU-FPGAs SoCs","authors":"Konstantinos Iordanou, Oscar Palomar, John Mawer, Cosmin Gorgovan, A. Nisbet, M. Luján","doi":"10.1109/FCCM.2019.00031","DOIUrl":"https://doi.org/10.1109/FCCM.2019.00031","url":null,"abstract":"This paper describes a flexible infrastructure for fast computer architecture simulation and prototyping of accelerator IP. A trend for System-on-Chips is to include application specific accelerators on the die. However, there is still a key research problem that needs to be addressed: How do hardware accelerators interact with the processors of a system and what is the impact on overall performance? To solve this problem, we propose an infrastructure that can directly simulate unmodified application executables with FPGA hardware accelerators. Unmodified application binaries are dynamically instrumented to generate processor load/store and program counter events and any memory accesses generated by accelerators, that are sent to an FPGA-based out-of-order pipeline model. The key features of our infrastructure are the ability to code exclusively at the user level, to dynamically discover and use available hardware models at run time, to test and simultaneously optimize hardware accelerators in an heterogeneous system. In terms of evaluation, we present a comparison between our system and Gem5 to demonstrate accuracy and relative performance, using the SPEC CPU benchmarks; even though our system is implemented on Zynq XC7045 which integrates dual 667MHz Arm Cortex-A9s with substantial FPGA resources, it outperforms Gem5 running on a Xeon E3 3.2 GHz with 32GBs of RAM. We also evaluate our infrastructure in simulating the interaction of accelerators with processors using accelerators taken from the Mach Benchmark Suite and other custom accelerators from computer vision applications.","PeriodicalId":116955,"journal":{"name":"2019 IEEE 27th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)","volume":"12 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128680842","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

EFCAD — An Embedded FPGA CAD Tool Flow for Enabling On-chip Self-Compilation 实现片上自编译的嵌入式FPGA CAD工具流

2019 IEEE 27th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM) Pub Date : 2019-04-01 DOI: 10.1109/FCCM.2019.00011

K. Pham, Malte Vesper, Dirk Koch, Eddie Hung

引用次数: 4

Exploiting Irregular Memory Parallelism in Quasi-Stencils through Nonlinear Transformation 利用非线性变换开发准模板的不规则存储并行性

2019 IEEE 27th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM) Pub Date : 2019-04-01 DOI: 10.1109/FCCM.2019.00039

Juan Escobedo, Mingjie Lin

{"title":"Exploiting Irregular Memory Parallelism in Quasi-Stencils through Nonlinear Transformation","authors":"Juan Escobedo, Mingjie Lin","doi":"10.1109/FCCM.2019.00039","DOIUrl":"https://doi.org/10.1109/FCCM.2019.00039","url":null,"abstract":"Non-stencil kernels with irregular memory accesses pose unique challenges to achieving high computing performance and hardware efficiency in high-level synthesis (HLS) of FPGA. We present a highly versatile and systematic approach to effectively synthesizing a special and important subset of non-stencil computing kernels, quasi-stencils, which possess the mathematical property that, if studied in a particular kind of high-dimensional space corresponding to the prime factorization space, the distance between the memory accesses during each kernel iteration becomes constant and such an irregular non-stencil can be considered as a stencil. This opens the door to exploiting a vast array of existing memory optimization algorithms, such as memory partitioning/banking and data reuse, originally designed for the standard stencil-based kernel computing, therefore offering totally new opportunity to effectively synthesizing irregular non-stencil kernels. We show the feasibility of our approach implementing our methodology in a KC705 Xilinx FPGA board and tested it with several custom code segments that meet the quasi-stencil requirement vs some of the state-of the art methods in memory partitioning. We achieve significant reduction in partition factor, and perhaps more importantly making it proportional to the number of memory accesses instead of depending on the problem size with the cost of some wasted space.","PeriodicalId":116955,"journal":{"name":"2019 IEEE 27th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)","volume":"22 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125560401","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Monobit Wideband Receiver with Integrated Dithering in FPGA FPGA中集成抖动的单比特宽带接收机

2019 IEEE 27th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM) Pub Date : 2019-04-01 DOI: 10.1109/FCCM.2019.00073

Dan Pritsker, Colman Cheung

引用次数: 1

FASE: FPGA Acceleration of Secure Function Evaluation FASE: FPGA加速安全功能评估

2019 IEEE 27th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM) Pub Date : 2019-04-01 DOI: 10.1109/FCCM.2019.00045

S. Hussain, F. Koushanfar

{"title":"FASE: FPGA Acceleration of Secure Function Evaluation","authors":"S. Hussain, F. Koushanfar","doi":"10.1109/FCCM.2019.00045","DOIUrl":"https://doi.org/10.1109/FCCM.2019.00045","url":null,"abstract":"We present FASE, an FPGA accelerator for Secure Function Evaluation (SFE) by employing the well-known cryptographic protocol named Yao's Garbled Circuit (GC). SFE allows two parties to jointly compute a function on their private data and learn the output without revealing their inputs to each other. FASE is designed to allow cloud servers to provide secure services to a large number of clients in parallel while preserving the privacy of the data from both sides. Current SFE accelerators either target specific applications, and therefore are not amenable to generic use, or have low throughput due to inefficient management of resources. In this work, we present a pipelined architecture along with an efficient scheduling scheme to ensure optimal usage of the available resources. The scheme is built around a simulator of the hardware design that schedules the workload and assigns the most suitable task to the encryption cores at each cycle. This, coupled with optimal management of the read and write cycles of the Block RAM on FPGA, results in a minimum 2 orders of magnitude improvement in terms of throughput per core for the reported benchmarks compared to the most recent generic GC accelerator. Moreover, our encryption core requires 17% less resource compared to the most recent secure GC realization.","PeriodicalId":116955,"journal":{"name":"2019 IEEE 27th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)","volume":"271 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115819131","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 16

Deep Packet Inspection in FPGAs via Approximate Nondeterministic Automata 基于近似不确定性自动机的fpga深度包检测

2019 IEEE 27th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM) Pub Date : 2019-04-01 DOI: 10.1109/FCCM.2019.00025

Milan Ceska, Vojtěch Havlena, L. Holík, J. Korenek, Ondřej Lengál, Denis Matousek, J. Matoušek, Jakub Semric, Tomáš Vojnar

{"title":"Deep Packet Inspection in FPGAs via Approximate Nondeterministic Automata","authors":"Milan Ceska, Vojtěch Havlena, L. Holík, J. Korenek, Ondřej Lengál, Denis Matousek, J. Matoušek, Jakub Semric, Tomáš Vojnar","doi":"10.1109/FCCM.2019.00025","DOIUrl":"https://doi.org/10.1109/FCCM.2019.00025","url":null,"abstract":"Deep packet inspection via regular expression (RE) matching is a crucial task of network intrusion detection systems (IDSes), which secure Internet connection against attacks and suspicious network traffic. Monitoring high-speed computer networks (100 Gbps and faster) in a single-box solution demands that the RE matching, traditionally based on finite automata (FAs), is accelerated in hardware. In this paper, we describe a novel FPGA architecture for RE matching that is able to process network traffic beyond 100 Gbps. The key idea is to reduce the required FPGA resources by leveraging approximate nondeterministic FAs (NFAs). The NFAs are compiled into a multi-stage architecture starting with the least precise stage with a high throughput and ending with the most precise stage with a low throughput. To obtain the reduced NFAs, we propose new approximate reduction techniques that take into account the profile of the network traffic. Our experiments showed that using our approach, we were able to perform matching of large sets of REs from SNORT, a popular IDS, on unprecedented network speeds.","PeriodicalId":116955,"journal":{"name":"2019 IEEE 27th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)","volume":"323 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129773658","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 11

PIR-DSP: An FPGA DSP Block Architecture for Multi-precision Deep Neural Networks PIR-DSP:一种用于多精度深度神经网络的FPGA DSP块结构

2019 IEEE 27th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM) Pub Date : 2019-04-01 DOI: 10.1109/FCCM.2019.00015

Seyedramin Rasoulinezhad, Hao Zhou, Lingli Wang, Philip H. W. Leong

{"title":"PIR-DSP: An FPGA DSP Block Architecture for Multi-precision Deep Neural Networks","authors":"Seyedramin Rasoulinezhad, Hao Zhou, Lingli Wang, Philip H. W. Leong","doi":"10.1109/FCCM.2019.00015","DOIUrl":"https://doi.org/10.1109/FCCM.2019.00015","url":null,"abstract":"Quantisation is a key optimisation strategy to improve the performance of floating-point deep neural network (DNN) accelerators. Digital signal processing (DSP) blocks on field-programmable gate arrays are not efficiently utilised when the accelerator precision is much lower than the DSP precision. Through three modifications to Xilinx DSP48E2 DSP blocks, we address this issue for important computations in embedded DNN accelerators, namely the standard, depth-wise, and pointwise convolutional layers. First, we propose a flexible precision, run-time decomposable multiplier architecture for CNN implementations. Second, we propose a significant upgrade to DSPDSP interconnect, providing a semi-2D low precision chaining capability which supports our low-precision multiplier. Finally, we improve data reuse via a register file which can also be configured as FIFO. Compared with the 27 × 18-bit mode in the Xilinx DSP48E2, our Precision, Interconnect, and Reuseoptimised DSP (PIR-DSP) offers a 6× improvement in multiplyaccumulate operations per DSP in the 9 × 9-bit case, 12× for 4 × 4 bits, and 24× for 2 × 2 bits. We estimate that PIR-DSP decreases the run time energy to 31/19/13% of the original value in a 9/4/2-bit MobileNet-v2 DNN implementation.","PeriodicalId":116955,"journal":{"name":"2019 IEEE 27th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129506334","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 30