P. Meloni, Gianfranco Deriu, Francesco Conti, Igor Loi, L. Raffo, L. Benini
{"title":"A high-efficiency runtime reconfigurable IP for CNN acceleration on a mid-range all-programmable SoC","authors":"P. Meloni, Gianfranco Deriu, Francesco Conti, Igor Loi, L. Raffo, L. Benini","doi":"10.1109/ReConFig.2016.7857144","DOIUrl":"https://doi.org/10.1109/ReConFig.2016.7857144","url":null,"abstract":"Convolutional Neural Networks (CNNs) are a nature-inspired model, extensively employed in a broad range of applications in computer vision, machine learning and pattern recognition. The CNN algorithm requires execution of multiple layers, commonly called convolution layers, that involve application of 2D convolution filters of different sizes over a set of input image features. Such a computation kernel is intrinsically parallel, thus significantly benefits from acceleration on parallel hardware. In this work, we propose an accelerator architecture, suitable to be implemented on mid-to high-range FPGA devices, that can be re-configured at runtime to adapt to different filter sizes in different convolution layers. We present an accelerator configuration, mapped on a Xilinx Zynq XC-Z7045 device, that achieves up to 120 GMAC/s (16 bit precision) when executing 5×5 filters and up to 129 GMAC/s when executing 3×3 filters, consuming less than 10W of power, reaching more than 97% DSP resource utilizazion at 150MHz operating frequency and requiring only 16B/cycle I/O bandwidth.","PeriodicalId":431909,"journal":{"name":"2016 International Conference on ReConFigurable Computing and FPGAs (ReConFig)","volume":"69 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125026773","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A novel and efficient method to initialize FPGA embedded memory content in asymptotically constant time","authors":"Matěj Bartík, S. Ubik, P. Kubalík","doi":"10.1109/ReConFig.2016.7857146","DOIUrl":"https://doi.org/10.1109/ReConFig.2016.7857146","url":null,"abstract":"This paper describes analysis and implementation of a new method for maintaining valid content of FPGA memory blocks with an asymptotically constant time synchronous clear ability, that can be useful for (re)initialization to one default value. A particular application can be for high-speed real-time LZ77 [1] lossless compression algorithms, where a dictionary has to be (re)initialized before each run of the implemented compression algorithm. The method is based on two most widely used techniques for clearing the memory content: a linear passage of the memory and clearing each cell by writing a default value and creating a register field providing an (in)valid bit for each memory cell. Our solution combines these two techniques together with the use of FPGA distributed memory blocks implemented in LUTs (Look-Up Tables) to overcome negative features of each previous method without losing the most of positive features. Our solution provides a balance between the two previous techniques and exceeds them in speed, resources utilization and latency of (re)initialization.","PeriodicalId":431909,"journal":{"name":"2016 International Conference on ReConFigurable Computing and FPGAs (ReConFig)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129527137","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"An FPGA-based design for joint control and monitoring of permanent magnet synchronous motors","authors":"Paul Rogers, R. Kavasseri, Scott C. Smith","doi":"10.1109/ReConFig.2016.7857152","DOIUrl":"https://doi.org/10.1109/ReConFig.2016.7857152","url":null,"abstract":"We present an FPGA-based design approach that will allow motor control and on-board condition monitoring to be achieved in parallel using the same set of physical variables. The key idea lies in exploiting the parallelism of FPGAs to achieve the joint objectives of control and monitoring using common physical variables, namely motor phase currents. For illustration, a permanent magnet synchronous machine (PMSM) governed by Field Oriented Control (FOC) and health monitoring using Motor Current Signature Analysis (MCSA) is considered. Since FOC is computationally intensive, the control algorithm is optimized for speed using distributed pipelining and an FFT-core is utilized for MCSA. The design stage uses MATLAB/Simulink for the reference control coupled with HDL coder for VHDL generation. The synthesis and timing analysis are done with Altera's Quartus II. The tool-chain allows easy analysis and optimization of model-based motor control algorithms. The results show that the joint objectives can be obtained with decreased current and torque ripple at the expense of modest increases in resource utilization.","PeriodicalId":431909,"journal":{"name":"2016 International Conference on ReConFigurable Computing and FPGAs (ReConFig)","volume":"4 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129214969","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"High-level synthesis of a genomic database search engine","authors":"Rasha Karakchi, Jordan A. Bradshaw, J. Bakos","doi":"10.1109/ReConFig.2016.7857174","DOIUrl":"https://doi.org/10.1109/ReConFig.2016.7857174","url":null,"abstract":"Genomic database search is an I/O-bound problem, so avoiding unnecessary I/O transactions is a key consideration for improving search throughput. Many approximate search tools such as NCBI BLAST perform a database scan for each query, lacking a mechanism to avoid access to portions of the database that offer no potential for a match. In this paper we present an approach for using an FPGA-based pattern filter to convert each search query into a set of potential database matches that reduces the average portion of the database accessed per query. The approach is based on a hardware design for a pattern filter that can achieve a sustained recognition rate of one pattern per cycle. We used Vivado HLS to design the filter. Despite the presence of loop-carried dependencies, the final design meets the maximum possible throughout as constrained by the code's arithmetic intensity and available memory bandwidth. In this paper we describe the filter implementation and our code tuning methodology.","PeriodicalId":431909,"journal":{"name":"2016 International Conference on ReConFigurable Computing and FPGAs (ReConFig)","volume":"8 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126343622","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Thomas B. Preußer, M. Zabel, Patrick Lehmann, R. Spallek
{"title":"The portable open-source IP core and utility library PoC","authors":"Thomas B. Preußer, M. Zabel, Patrick Lehmann, R. Spallek","doi":"10.1109/ReConFig.2016.7857191","DOIUrl":"https://doi.org/10.1109/ReConFig.2016.7857191","url":null,"abstract":"Standard libraries and frameworks boost the productivity and performance significantly as they enable the re-use of optimized solutions for standard tasks. Hardware designs are often unnecessarily complex because a) a rich RTL library of standard solutions is missing and b) designs must often sacrifice portable and readable behavioral descriptions so as to meet timing and area constraints on the targeted device. The PoC Library addresses these issues. First of all, it provides abstracted solutions for standard tasks. These include single- and dual-port memory components as well as higher-level data structures such as FIFOs, stacks and deques built on top of them. The library further comprises cross-clock triggers, arithmetic and algorithmic cores, as for wide addition and sorting, as well as communication stack implementations. Each implementation is encapsulated by a stable interface that is independent from the specific target platform. Nonetheless, device-specific optimizations are available through specialized implementations, which are selected internally whenever this is beneficial or necessitated by the vendor flow. The provided modules are highly parametrizable to fit the application needs and enable design space exploration. An extensive set of utility functions and frequently used data types benefits the conciseness of both library and user code. Finally, PoC enables the continuous verification of its IP cores by automated testbenches. This verification flow is only one part of a flow infrastructure that also supports the generation of re-usable netlists as to speed up the integration of more complex cores into an application design. The flow infrastructure is implemented in Python and supports various simulation backends, synthesis tool chains and operating systems.","PeriodicalId":431909,"journal":{"name":"2016 International Conference on ReConFigurable Computing and FPGAs (ReConFig)","volume":"12 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134422999","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Automatic framework to generate reconfigurable accelerators for option pricing applications","authors":"Pham Nam Khanh, Khin Mi Mi Aung, Akash Kumar","doi":"10.1109/ReConFig.2016.7857157","DOIUrl":"https://doi.org/10.1109/ReConFig.2016.7857157","url":null,"abstract":"Option Pricing is a fundamental application in most financial institutions dealing with derivative market. It frequently requires huge computational effort and low latency demand. Therefore, a number of different Option Pricing implementations have been developed on FPGA-based platform. However, none of the existing works cover more than one models or different types of options, which yields problem of productively implementing several hardware accelerators for different models. To fill in the gap, we propose a design flow for generating efficient hardware accelerators for option pricing applications with different models and option types. The framework boosts the designers productivity and enables quick prototyping on FPGA platform by providing general template architecture for option pricing applications. The architecture comes along with a prebuilt design library, which covers a wide range of popular financial models. Experimental results for four models show that the accelerators generated from our design flow outperform their counterpart software implementation with two order of magnitude speedup. While comparing with existing hardware designs for the same models, our framework can produce the accelerators that overcome most of manual designed engines.","PeriodicalId":431909,"journal":{"name":"2016 International Conference on ReConFigurable Computing and FPGAs (ReConFig)","volume":"693 ","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"113994405","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Survey on and re-evaluation of wide adder architectures on FPGAs","authors":"Thomas B. Preußer, Markus Krause","doi":"10.1109/ReConFig.2016.7857189","DOIUrl":"https://doi.org/10.1109/ReConFig.2016.7857189","url":null,"abstract":"The binary word addition was one of the earliest operations that called for special-purpose hardware structures on otherwise freely programmable logic devices. The large logic depth induced by the great fanin that comprises both operands of the addition is especially harmful in SRAM-programmed FPGAs where the delays of the configurable inter-LUT routing are expensive in comparison to the delays of the connected logic stages. These costs have been addressed by carry chains that establish direct links through a linear series of configurable logic blocks on virtually all modern FPGA devices. This structure is so effective that it puts a simple linear adder layout at a great advantage. Although it must eventually recede behind more sophisticated hierarchical adder structures with logarithmic delays, the actual turning point has been pushed beyond operand widths of 50 bits or more. Thus, many designs can simply rely on the default addition that is so well supported directly by the hardware. This changes in the context of extraordinarily wide operands as they are often found in cryptographic applications. They require designers to identify an appropriate wide adder implementation that is able to meet their design goals. The typical bottleneck imposed by the wide fanin of addition is the achievable clock rate. Various authors have analyzed the performance of the classic fast addition schemes and proposed adder architectures that genuinely blend classic hierarchical approaches with the capabilities of the fast carry chains. This paper presents a survey across these proposals and re-evaluates them in the context of modern FPGA devices.","PeriodicalId":431909,"journal":{"name":"2016 International Conference on ReConFigurable Computing and FPGAs (ReConFig)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129932904","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Detection and Isolation of permanent faults in FPGAs with remote access","authors":"F. Rittner, R. Glein, A. Heuberger","doi":"10.1109/ReConFig.2016.7857165","DOIUrl":"https://doi.org/10.1109/ReConFig.2016.7857165","url":null,"abstract":"In this paper, we present a Fault Detection (FD) and Fault Isolation (FI) concept for permanent faults in SRAM-based FPGAs. Harsh environments, such as space, cause permanent faults, which results in defective hardware. Furthermore, physical inaccessible systems in such environments limit debugging capabilities. The proposed concept uses wireless remote debugging to access the remote device. The presumption of a permanent fault starts an extended hardware debugging procedure by performing off-line tests without user application. First, the FD process confirms the fault as permanent and reject temporary faults (known as static and transient). Afterwards, a fine-grain permanent FD and FI determines affected primitives in the FPGA and exclude this primitive from the FPGA design. This is realized with a customized recovery strategy for each primitive type, by blocking and bypassing these defective primitives. Focus of this paper is the feasibility of the concept.","PeriodicalId":431909,"journal":{"name":"2016 International Conference on ReConFigurable Computing and FPGAs (ReConFig)","volume":"49 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125650535","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"REoN: A protocol for reliable software-defined FPGA partial reconfiguration over network","authors":"Vaibhawa Mishra, Qianqiao Chen, G. Zervas","doi":"10.1109/ReConFig.2016.7857184","DOIUrl":"https://doi.org/10.1109/ReConFig.2016.7857184","url":null,"abstract":"This paper presents and defines a Reconfiguration over Network (REoN) protocol. It is a solution for a FPGA-based dynamically reconfigurable system, that offers partial (re)programming over the network without the need of a local/embedded soft/hard processor. This protocol can transport partial bit files from centralized control and management system via network resource management API to a FPGA empowered network node, using standard 10 Gbps Ethernet. This work architects and introduces a proprietary lightweight connection oriented protocol stack, which guarantees reliability over standard UDP/IP protocol. Hardware stack for standard networking protocols including remote reconfiguration engine directly interfaced with Xilinx Internal Configuration Access Port (ICAP). This minimizes FPGA resource requirements in re-programming the FPGA. The presented work is an enabling technology for a range of applications such as reconfigurable computing enabled Network Function Virtualization (NFV), function dis aggregation on data centres empowered by FPGA/SoCs, as well as Internet of Things (IoT).","PeriodicalId":431909,"journal":{"name":"2016 International Conference on ReConFigurable Computing and FPGAs (ReConFig)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126078391","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Mario Ruiz, G. Sutter, S. López-Buedo, J. D. Vergara
{"title":"FPGA-based encrypted network traffic identification at 100 Gbit/s","authors":"Mario Ruiz, G. Sutter, S. López-Buedo, J. D. Vergara","doi":"10.1109/ReConFig.2016.7857172","DOIUrl":"https://doi.org/10.1109/ReConFig.2016.7857172","url":null,"abstract":"Network traffic monitoring is becoming increasingly hard to manage due to the ever-growing speed of network links. At 100 Gbit/s, the huge volume of data makes it very difficult to perform online analyses or to store traffic for subsequent forensic investigations. It is therefore mandatory to carry out some kind of filtering and/or capping in the network traffic to be analyzed. Additionally, the fraction of encrypted traffic is relentlessly increasing. For such encrypted traffic, storing the payload is most times useless. In this paper we present an FPGA implementation of a method to identify plain text (that is, human readable) in the network packet payload. The method is based on both detecting bursts of printable ASCII characters and calculating the fraction of these printable characters in the packet payload. This method has proven to be very effective in reducing the amount of information used in traffic analysis, by saving only the headers of packets with encrypted payloads. We leveraged the advantages of high-level languages to reduce development time, though traditional HDL languages were also used to optimize critical areas of the design. The design targets the 100 Gbit/s Ethernet interfaces of Xilinx Virtex UltraScale devices and it is able to detect human-readable packet payloads at line rate, with a high accuracy.","PeriodicalId":431909,"journal":{"name":"2016 International Conference on ReConFigurable Computing and FPGAs (ReConFig)","volume":"21 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128775252","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}