{"title":"Integrating FPGA-based processing elements into a runtime for parallel heterogeneous computing","authors":"David de la Chevallerie, Jens Korinth, A. Koch","doi":"10.1109/FPT.2014.7082807","DOIUrl":"https://doi.org/10.1109/FPT.2014.7082807","url":null,"abstract":"In this work, we present an approach how FPGA-based computing can be integrated into a heterogeneous computing environment in an embedded systems context, using the x1 Ort run-time of the X10 language system as a case-study. To this end, we present a hardware/software framework for pools of reconfigurable compute elements, and show how high-level synthesis can be employed to generate the actual processing cores. Our framework is sufficiently lean to deliver high performance FPGA implementations even at high area utilization (operating at 250 MHz with up to 90% of the device area used), and capable of low-latency access to pools of dozens of instances of custom IP cores, automatically generated by high-level synthesis tools.","PeriodicalId":6877,"journal":{"name":"2014 International Conference on Field-Programmable Technology (FPT)","volume":"10 1","pages":"314-317"},"PeriodicalIF":0.0,"publicationDate":"2014-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"79304951","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Comparing performance, productivity and scalability of the TILT overlay processor to OpenCL HLS","authors":"Rafat Rashid, J. Steffan, Vaughn Betz","doi":"10.1109/FPT.2014.7082748","DOIUrl":"https://doi.org/10.1109/FPT.2014.7082748","url":null,"abstract":"High-Level-Synthesis (HLS) tools translate a software description of an application into custom FPGA logic, increasing designer productivity vs. Hardware Description Language (HDL) design flows. Overlays seek to further improve productivity by reducing application compile times and raising abstraction by enabling the designer to target a software-programmable substrate instead of the underlying FPGA. We compare the performance, development effort and scalability of two C-to-FPGA approaches: our TILT overlay processor and Altera's OpenCL HLS. Our application-customized TILT implementations of five data-parallel benchmarks have from 41 % to 80% of the throughput per unit of layout area achieved by our best OpenCL HLS designs. The time required for initial hardware compilation of these TILT designs and configuration of the target application onto the overlay is roughly comparable to the compile times of the OpenCL HLS designs: 28 and 103 minutes on average respectively. However subsequent reconfigurations due to changes in the application that do not require re-synthesis of the overlay are fast, taking 38 seconds on average. In contrast, OpenCL HLS applications require full recompilation after every code change. TILT also enables smaller, more area-efficient designs than OpenCL HLS when low to moderate throughput is sufficient. For high throughput, the larger spatially pipelined designs of OpenCL HLS are preferable.","PeriodicalId":6877,"journal":{"name":"2014 International Conference on Field-Programmable Technology (FPT)","volume":"50 1","pages":"20-27"},"PeriodicalIF":0.0,"publicationDate":"2014-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"91386502","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Hardware Trojan detection acceleration based on word-level statistical properties management","authors":"He Li, Qiang Liu","doi":"10.1109/FPT.2014.7082769","DOIUrl":"https://doi.org/10.1109/FPT.2014.7082769","url":null,"abstract":"Hardware Trojan insertion has raised serious concerns to semiconductor industry and government agencies. Hardware Trojan is usually activated under rare conditions associated with low transition bits in a circuit. The damage includes circuit functional failure or important information leakage. Previous research on hardware Trojan detection is mainly based on side-channel analysis and Trojan activation. Long activation time is a major concern during the detection process. In this paper, we propose a novel approach for efficiently accelerating Trojan activation by increasing the transition activity of rare bits. In particular, the proposed approach increases the bit-level transition activity by controlling signal word-level statistical properties, such as changing the variance and autocorrelation of the signal. In addition, by analyzing the signal propagation statistical properties through various digital signal processing (DSP) operators such as adders and multipliers, the proposed approach can control the statistical properties of internal signals and then enhance the internal bit transition activity from the primary input of the circuit. The proposed approach is evaluated on several circuits. The results show that the transition activity of rare bits can be dramatically increased by up to 166.7 times and Trojan activation time can be reduced by up to 121 times.","PeriodicalId":6877,"journal":{"name":"2014 International Conference on Field-Programmable Technology (FPT)","volume":"96 1","pages":"153-160"},"PeriodicalIF":0.0,"publicationDate":"2014-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"77623290","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Efficient FPGA implementation of digit parallel online arithmetic operators","authors":"Kan Shi, D. Boland, G. Constantinides","doi":"10.1109/FPT.2014.7082763","DOIUrl":"https://doi.org/10.1109/FPT.2014.7082763","url":null,"abstract":"Online arithmetic has been widely studied for ASIC implementation. Online components were originally designed to perform computations in digit serial with most significant digit (MSD) first, resulting in the ability to chain arithmetic operators together for low latency. More recently, research has shown that digit parallel online operators can fail more gracefully when operating beyond the deterministic clocking region in comparison to operators with conventional arithmetic. Unfortunately, the utilization of online arithmetic operators in the past has required a large area overhead for FPGA implementation. In this paper, we propose novel approaches to implement the key primitives of online arithmetic, adders and multipliers, efficiently on modern Xilinx FPGAs with 6-input LUTs and carry resources. We demonstrate experimentally that in comparison to a direct RTL synthesis, the proposed architectures achieve slice savings of over 67% and 69%, and speed-ups of over 1.2x and 1.5x for adders and multipliers, respectively. As a result, the area overheads of using online adders and multipliers in place of traditional arithmetic primitives is reduced from 8.41 x and 8.11 x to 1.88x and 1.84x respectively. Finally, because an online multiplier generates MSDs first, we also demonstrate the method to create an online multiplier with a reduced precision output that is smaller than a traditional multiplier producing the same result. We show that this can lead to silicon area savings of up to 56%.","PeriodicalId":6877,"journal":{"name":"2014 International Conference on Field-Programmable Technology (FPT)","volume":"10 1","pages":"115-122"},"PeriodicalIF":0.0,"publicationDate":"2014-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"79937125","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Gopinath Mahale, H. Mahale, Rajesh Babu Parimi, S. Nandy, S. Bhattacharya
{"title":"Hardware architecture of bi-cubic convolution interpolation for real-time image scaling","authors":"Gopinath Mahale, H. Mahale, Rajesh Babu Parimi, S. Nandy, S. Bhattacharya","doi":"10.1109/FPT.2014.7082790","DOIUrl":"https://doi.org/10.1109/FPT.2014.7082790","url":null,"abstract":"This paper presents two hardware architectures of bi-cubic convolution interpolation termed Parallelized Row Column Interpolation Architecture (PRCIA) and Serialized Row Column Interpolation Architecture (SRCIA) for real-time image scaling. These architectures factor in the challenges of high computational complexity, redundant computations and repeated memory accesses, which were otherwise not explicitly addressed in existing architectures. Besides, the proposed architectures also employ parallel computations to improve the throughput for realtime applications. The proposed architectures have been emulated and tested on Virtex-6 FPGA. The emulated PRCIA and SRCIA are able to scale input grayscale images of dimensions up to 640 × 480 at 59 and 48 frames per second respectively with arbitrary scaling factors up to 4 in both dimensions.","PeriodicalId":6877,"journal":{"name":"2014 International Conference on Field-Programmable Technology (FPT)","volume":"84 10 1","pages":"264-267"},"PeriodicalIF":0.0,"publicationDate":"2014-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"91137093","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
T. Kajiwara, Qian Zhao, M. Amagasaki, M. Iida, Morituro Kuga, T. Sueyoshi
{"title":"A novel three-dimensional FPGA architecture with high-speed serial communication links","authors":"T. Kajiwara, Qian Zhao, M. Amagasaki, M. Iida, Morituro Kuga, T. Sueyoshi","doi":"10.1109/FPT.2014.7082805","DOIUrl":"https://doi.org/10.1109/FPT.2014.7082805","url":null,"abstract":"Three-dimensional (3D) integrated circuit technology is expected to offer continual improvement to very-large-scale integration performance as the process of miniaturization approaches physical limits. However, because the through-silicon vias (TSVs) that are used to create interlayer vertical connections are much larger area than transistors, there is an inherent tradeoff between connectivity and small size. Field-programmable gate arrays (FPGAs) are particularly noted for requiring a high level of routing resources, which means that it is unrealistic to make the same number of connections vertically as horizontally. In previous research, we proposed a method for creating a two-layer compact 3D FPGA with face-down integration (the base FPGA). In this paper, we discuss stacking multiple base FPGAs by the face-up method and propose a method for achieving highspeed interlayer communications with TSV serial connections. The proposed architecture improves FPGA performance by using smaller TSVs. The evaluation results show that the proposed 3D FPGA can achieve a total area that is as low as 67% the equivalent two-dimensional FPGA.","PeriodicalId":6877,"journal":{"name":"2014 International Conference on Field-Programmable Technology (FPT)","volume":"97 1","pages":"306-309"},"PeriodicalIF":0.0,"publicationDate":"2014-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"88782408","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A high-performance low-power near-Vt RRAM-based FPGA","authors":"Xifan Tang, P. Gaillardon, G. Micheli","doi":"10.1109/FPT.2014.7082777","DOIUrl":"https://doi.org/10.1109/FPT.2014.7082777","url":null,"abstract":"The routing architecture, heavily using programmable switches, dominates the area, delay and power of Field Programmable Gate Arrays (FPGAs). Resistive Random Access Memories (RRAMs) enable high-performance routing architectures through the replacement of Static Random Access Memory (SRAM)-based programming switches. Exploiting the very low on-resistance state achievable by RRAMs, RRAM-based routing multiplexers can be used to significantly reduce the FPGA routing delays. In addition, RRAM-based routing architectures are less sensitive to supply voltage reductions and show promises in low-power FPGA designs. In this paper, we propose a near-Vt low-power RRAM-based FPGA where both delay and power reductions are achieved. Experimental results demonstrate that a near-Vi RRAM-based FPGA design leads to a 15% area shrink, a 10% delay reduction, and a 65% power improvement, compared to a conventional FPGA design for a given technology node. To achieve low on-resistance values, RRAMs typically require high programming currents. In other word, they need relatively large programming transistors, potentially resulting in area, delay and power inefficiencies. We also present a design methodology to properly size the programming transistors of RRAMs in order to further improve the area-efficiency. Experimental results show that a correct programming transistor sizing strategy contributes to further 18% area and 2% delay shrink, compared to the initial near-Vi RRAM-based FPGA.","PeriodicalId":6877,"journal":{"name":"2014 International Conference on Field-Programmable Technology (FPT)","volume":"48 1","pages":"207-214"},"PeriodicalIF":0.0,"publicationDate":"2014-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"82217499","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Q. Gautier, A. Shearer, J. Matai, D. Richmond, Pingfan Meng, R. Kastner
{"title":"Real-time 3D reconstruction for FPGAs: A case study for evaluating the performance, area, and programmability trade-offs of the Altera OpenCL SDK","authors":"Q. Gautier, A. Shearer, J. Matai, D. Richmond, Pingfan Meng, R. Kastner","doi":"10.1109/FPT.2014.7082810","DOIUrl":"https://doi.org/10.1109/FPT.2014.7082810","url":null,"abstract":"Embedding real-time 3D reconstruction of a scene from a low-cost depth sensor can improve the development of technologies in the domains of augmented reality, mobile robotics, and more. However, current implementations require a computer with a powerful GPU, which limits its prospective applications with low-power requirements. To implement low-power 3D reconstruction we embedded two prominent algorithms of 3D reconstruction (Iterative Closest Point and Volumetric Integration) on an Altera Stratix V FPGA by using the OpenCL language and the Altera OpenCL SDK. In this paper, we present our application and evaluation of the Altera tool in terms of performance, area, and programmability trade-offs. We have verified that OpenCL can be a viable method for developing FPGA applications by modifying an open-source version of the Microsoft KinectFusion project to run partially on a FPGA.","PeriodicalId":6877,"journal":{"name":"2014 International Conference on Field-Programmable Technology (FPT)","volume":"186 1","pages":"326-329"},"PeriodicalIF":0.0,"publicationDate":"2014-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"77022044","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Deep and narrow binary content-addressable memories using FPGA-based BRAMs","authors":"Ameer Abdelhadi, G. Lemieux","doi":"10.1109/FPT.2014.7082808","DOIUrl":"https://doi.org/10.1109/FPT.2014.7082808","url":null,"abstract":"Binary Content Addressable Memories (BCAMs) are massively parallel search engines capable of searching the entire memory space in a single clock cycle. BCAMs are used in a wide range of applications, such as memory management, networks, data compression, DSP, and databases. Due to the increasing amount of processed information, modern BCAM applications demand a deep searching space. However, traditional BCAM approaches in FPGAs suffer from storage inefficiency. In this paper, a novel and efficient technique for constructing deep and narrow BCAMs out of standard SRAM blocks in FPGAs is proposed. This technique is most efficient for deep and narrow CAMs since the BRAM consumption is exponential to pattern width. Using Altera's Stratix V device, traditional methods achieve up to 64K-entry BCAM while the proposed technique achieves up to 4M entries. For the 64K-entry test-case, traditional methods consume 43 times more ALMs and achieves only one-third of the Fmax. A fully parameterized Verilog implementation is available1. This implementation has been extensively tested using Altera's tools.","PeriodicalId":6877,"journal":{"name":"2014 International Conference on Field-Programmable Technology (FPT)","volume":"7 1","pages":"318-321"},"PeriodicalIF":0.0,"publicationDate":"2014-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"90767201","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Memory security in reconfigurable computers: Combining formal verification with monitoring","authors":"T. Wiersema, Stephanie Drzevitzky, M. Platzner","doi":"10.1109/FPT.2014.7082771","DOIUrl":"https://doi.org/10.1109/FPT.2014.7082771","url":null,"abstract":"Ensuring memory access security is a challenge for reconfigurable systems with multiple cores. Previous work introduced access monitors attached to the memory subsystem to ensure that the cores adhere to pre-defined protocols when accessing memory. In this paper, we combine access monitors with a formal runtime verification technique known as proof-carrying hardware to guarantee memory security. We extend previous work on proof-carrying hardware by covering sequential circuits and demonstrate our approach with a prototype leveraging ReconOS/Zynq with an embedded ZUMA virtual FPGA overlay. Experiments show the feasibility of the approach and the capabilities of the prototype, which constitutes the first realization of proof-carrying hardware on real FPGAs. The area overheads for the virtual FPGA are measured as 2x-10x, depending on the resource type. The delay overhead is substantial with almost 100x, but this is an extremely pessimistic estimate that will be lowered once accurate timing analysis for FPGA overlays become available. Finally, reconfiguration time for the virtual FPGA is about one order of magnitude lower than for the native Zynq fabric.","PeriodicalId":6877,"journal":{"name":"2014 International Conference on Field-Programmable Technology (FPT)","volume":"21 1","pages":"167-174"},"PeriodicalIF":0.0,"publicationDate":"2014-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"75816079","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}