K. M. Abdellatif, R. Chotin-Avot, Z. Marrakchi, H. Mehrez, Qingshan Tang
{"title":"Towards high performance GHASH for pipelined AES-GCM using FPGAs (abstract only)","authors":"K. M. Abdellatif, R. Chotin-Avot, Z. Marrakchi, H. Mehrez, Qingshan Tang","doi":"10.1145/2554688.2554709","DOIUrl":"https://doi.org/10.1145/2554688.2554709","url":null,"abstract":"AES-GCM has been utilized in various security applications. It consists of two components: an Advanced Encryption Standard (AES) engine and a Galois Hash (GHASH) core. The performance of the system is determined by the GHASH architecture because of the inherent computation feedback. This paper introduces a modification for the pipelined Karatsuba Ofman Algorithm (KOA)-based GHASH. In particular, the computation feedback is removed by analyzing the complexity of the computation process. The proposed GHASH core is evaluated with three different implementations of AES ( BRAMs-based SubBytes, composite field-based SubBytes, and LUT-based SubBytes). The presented AES-GCM architectures are implemented using Xilinx Virtex5 FPGAs. Our comparison to previous work reveals that our architectures are more performance-efficient (Thr. /Slices).","PeriodicalId":390562,"journal":{"name":"Proceedings of the 2014 ACM/SIGDA international symposium on Field-programmable gate arrays","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-02-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128922322","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"FPGA LUT design for wide-band dynamic voltage and frequency scaled operation (abstract only)","authors":"M. Abusultan, S. Khatri","doi":"10.1145/2554688.2554708","DOIUrl":"https://doi.org/10.1145/2554688.2554708","url":null,"abstract":"Field programmable gate arrays (FPGAs) are the implementation platform of choice when it comes to design flexibility. However, the high power consumption of FPGAs (which arises due to their flexible structure), make them less appealing for extreme low power applications. In this paper, we present a design of an FPGA look-up table (LUT), with the goal of seamless operation over a wide band of supply voltages. The same LUT design has the ability to operate at sub-threshold voltage when low power is required, and at higher voltages whenever faster performance is required. The results show that operating the LUT in sub-threshold mode yields a (~80x) lower power and (~4x) lower energy than full supply voltage operation, for a 6-input LUT implemented in a 22nm predictive technology. The key drawback of sub-threshold operation is its susceptibility to process, temperature, and supply voltage (PVT) variations. This paper also presents the design and experimental results for a closed-loop adaptive body biasing mechanism to dynamically cancel these PVT variations. For the same 22nm technology, we demonstrate that the closed-loop adaptive body biasing circuits can allow the FPGA to operate over an operating frequency range that spans an order of magnitude (40 MHz to 1300 MHz). We also show that the closed-loop adaptive body biasing circuits can cancel delay variations due to supply voltage changes, and reduce the effect of process variations on setup and hold times by 1.8x and 2.9x respectively.","PeriodicalId":390562,"journal":{"name":"Proceedings of the 2014 ACM/SIGDA international symposium on Field-programmable gate arrays","volume":"138 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-02-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116403757","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"On hybrid memory allocation for FPGA behavioral synthesis (abstract only)","authors":"Qian Zhang, Chenfei Ma, Q. Xu","doi":"10.1145/2554688.2554697","DOIUrl":"https://doi.org/10.1145/2554688.2554697","url":null,"abstract":"FPGA behavioral synthesis has gained significant momentum recently with the growing interests in accelerating high-performance computing applications. While the latest generation of high-level synthesis (HLS) tools has made significant progress, they still lack the support for certain high-level language features such as dynamic memory allocation, despite the fact that efficiently utilization of the on-chip memory resources in FPGAs is critical to achieve the performance and power consumption target for many designs. To tackle the above problem, in this paper, we propose a novel hybrid memory allocation scheme to map malloc/free in C programing language onto FPGA platforms. By estimating the memory usage and available FPGA memory resources, the scheme judiciously allocates static memory blocks and/or instantiate hardware allocators for memory requests. And the partition between these two parts is based on estimated access counts and solving an ILP to minimize overhead from dynamic memory allocation. Experimental results on benchmark circuits demonstrate the efficacy of the proposed technique.","PeriodicalId":390562,"journal":{"name":"Proceedings of the 2014 ACM/SIGDA international symposium on Field-programmable gate arrays","volume":"57 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-02-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125419844","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A soft error vulnerability analysis framework for Xilinx FPGAs","authors":"Aitzan Sari, D. Agiakatsikas, M. Psarakis","doi":"10.1145/2554688.2554767","DOIUrl":"https://doi.org/10.1145/2554688.2554767","url":null,"abstract":"Today's SRAM-based FPGAs provide a reach set of computing resources which makes them attractive in demanding and critical application domains, such as avionics and space. Unfortunately, their high reliance on SRAM configuration memory arise reliability issues due to the single-event upsets (SEUs). Considering the criticality of these applications, the vulnerability analysis of FPGA designs to SEUs becomes essential part of the design flow. In this context, we present an open-source framework for the soft error vulnerability analysis of Xilinx FPGA devices. The proposed framework will allow researchers to evaluate their reliability-aware CAD algorithms and estimate the soft error susceptibility of the designs at early stages of the implementation flow for the latest Xilinx architectures.","PeriodicalId":390562,"journal":{"name":"Proceedings of the 2014 ACM/SIGDA international symposium on Field-programmable gate arrays","volume":"14 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-02-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129151347","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Application specific processor with high level synthesized instructions (abstract only)","authors":"V. Pus, Pavel Benácek","doi":"10.1145/2554688.2554754","DOIUrl":"https://doi.org/10.1145/2554688.2554754","url":null,"abstract":"The paper deals with the design of application-specific processor which uses high level synthesized instruction engines. This approach is demonstrated on the instance of high speed network flow measurement processor for FPGA. Our newly proposed concept called Software Defined Monitoring (SDM) relies on advanced monitoring tasks implemented in the software supported by a configurable hardware accelerator. The monitoring tasks reside in the software and can easily control the level of detail retained by the hardware for each flow. This way, the measurement of bulk/uninteresting traffic is offloaded to the hardware, while the interesting traffic is processed in the software. SDM enables creation of flexible monitoring systems capable of deep packet inspection at high throughput. We introduce the processor architecture and a workflow that allows to create hardware accelerated measurement modules (instructions) from the description in C/C++ language. The processor offloads various aggregations and statistics from the main system CPU. The basic type of offload is the NetFlow statistics aggregation. We create and evaluate three more aggregation instructions to demonstrate the flexibility of our system. Compared to the hand-written instructions, the high level synthesized instructions are slightly worse in terms of both FPGA resources consumption and frequency. However, the time needed for development is approximately half.","PeriodicalId":390562,"journal":{"name":"Proceedings of the 2014 ACM/SIGDA international symposium on Field-programmable gate arrays","volume":"38 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-02-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123463375","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Session details: Processors and systems","authors":"M. Leeser","doi":"10.1145/3260940","DOIUrl":"https://doi.org/10.1145/3260940","url":null,"abstract":"","PeriodicalId":390562,"journal":{"name":"Proceedings of the 2014 ACM/SIGDA international symposium on Field-programmable gate arrays","volume":"21 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-02-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131789643","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Implementing FPGA-based energy-efficient dense optical flow computation with high portability in C (abstract only)","authors":"Zhibin Wang, Wenmin Yang, Jin Yu, Zhilei Chai","doi":"10.1145/2554688.2554733","DOIUrl":"https://doi.org/10.1145/2554688.2554733","url":null,"abstract":"Optical flow computation is widely used in many video/image based applications such as motion detection, video compression etc. Dense optical flow field that provides more details of information is more useful in lots of applications. However, high-quality algorithms for dense optical flow computation are computationally expensive. For instance, on the ARM Cortex-A9 processor within ZYNQ, the popular linear variational method Combine-Brightness-Gradient (CBG), spends $26.68s per frame to compute optical flow when the image size is 640 x 480. It is difficult to be sped up especially when embedded systems with power constraints are considered. Poor portability is another factor to limit current implementations of optical flow computation to be used in more applications. In this paper, a high-performance, low-power FPGA-accelerated implementation of dense optical flow computation is presented. One high-quality dense optical flow method, the Combine-Brightness-Gradient model, is implemented. C code instead of VHDL/Verilog HDL is used to improve the productivity. Portability of the system is designed carefully for deploying it on different platforms conveniently. Experimental results show 12 fps and 0.38J per frame are achieved by this optical flow computing system when 640 x 480 image is used and optical flow for all pixels are computed. Furthermore, portability is demonstrated by implementing the optical flow algorithm on different heterogeneous platforms such as the ZYNQ-7000 SoC and the PC-FPGA platform with a Kintex-7 FPGA respectively.","PeriodicalId":390562,"journal":{"name":"Proceedings of the 2014 ACM/SIGDA international symposium on Field-programmable gate arrays","volume":"17 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-02-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130140292","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Asynchronous physical unclonable function using FPGA-based self-timed ring oscillator (abstract only)","authors":"R. Silwal, M. Niamat","doi":"10.1145/2554688.2554745","DOIUrl":"https://doi.org/10.1145/2554688.2554745","url":null,"abstract":"Recently, electronic industries have been facing an increased amount of hardware counterfeits. These counterfeit components, when assembled into a product or a system, can not only jeopardize performance and reliability but also create safety issues. Physical Unclonable Function (PUF) provides means to enhance physical security of Integrated Circuits (IC) against piracy and unauthorized access. The proposed design illustrates the feasibility of using self-timed ring oscillators as a novel approach towards PUF implementation for FPGA authentication. The proposed Self-Timed Ring Oscillator PUF (STRO-PUF) consists of two groups of identically laid-out self-timed ring oscillators. Inputs to the PUF are given through a challenge generator, which selects two self-timed ring oscillators from each group. Outputs of oscillators are fed to multiplexers of corresponding groups. Self-timed ring oscillators exploit the inherent features of random process variations by producing varying frequencies. These unpredictable variations in frequencies are captured using frequency comparator, which generates a output bit. A unique set of output bits , or response is generated for each set of input bits, or challenge. This unique Challenge Response Pair (CRP) is used in identifying a particular device. Frequencies generated from these oscillators are read through a logic analyzer. The varying frequencies observed from all the oscillators mapped across different regions of FPGAs range from 16.234 MHz to 125 MHz with the average frequency of 101.446 MHz. Experimental result shows the uniqueness for the PUF response is 49.92% which is very close to the desired 50% factor.","PeriodicalId":390562,"journal":{"name":"Proceedings of the 2014 ACM/SIGDA international symposium on Field-programmable gate arrays","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-02-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130242236","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
B. Hutchings, Joshua S. Monson, D. Savory, J. Keeley
{"title":"A power side-channel-based digital to analog converterfor Xilinx FPGAs","authors":"B. Hutchings, Joshua S. Monson, D. Savory, J. Keeley","doi":"10.1145/2554688.2554770","DOIUrl":"https://doi.org/10.1145/2554688.2554770","url":null,"abstract":"A novel Digital to Analog Converter (DAC) modulates the overall power consumption of an FPGA by disabling/enabling short circuits programmed into the interconnect. The power pin of the FPGA serves as the output of the DAC. The DAC achieves high linearity and can be used to implement applications in communications, security, etc. The shortcircuit-based DAC consumes 1/3 the area of an alternative shift-register-based DAC that is presented for the sake of comparison.","PeriodicalId":390562,"journal":{"name":"Proceedings of the 2014 ACM/SIGDA international symposium on Field-programmable gate arrays","volume":"11 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-02-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125770085","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A power-efficient adaptive heapsort for fpga-based image coding application (abstract only)","authors":"Yuhui Bai, S. Z. Ahmed, B. Granado","doi":"10.1145/2554688.2554746","DOIUrl":"https://doi.org/10.1145/2554688.2554746","url":null,"abstract":"This paper presents an adaptive heap sort architecture for an image coding implementation on FPGA, which specifically addresses the issue of sorting different amount of data located in each subband during the coding. The proposed sorting architecture is easily scalable. Performance of the sorter only depends on the amount of data sorted. The efficient usage of dual port memories yields high throughput up to 50 Msamples/s and their adaptive trigger/shutdown provide the average dynamic power reduction up to 20.9%. We designed this architecture and incorporated it in our Adaptive Scanning of Wavelet Data (ASWD) module which reorganizes the wavelet coefficients into locally stationary sequences for a wavelet-based image encoder. We validated the hardware on an Altera's Stratix IV FPGA as an IP accelerator in a Nios II processor based System on Chip. The architectural innovations can also be exploited in other applications that require high throughput and scalable sorting. Our experiments show that compared to an embedded ARM CortexA9 processor running at 666 MHz, our architecture at 100 MHz can provide around 13X speedup while consuming 242 mW average core dynamic power.","PeriodicalId":390562,"journal":{"name":"Proceedings of the 2014 ACM/SIGDA international symposium on Field-programmable gate arrays","volume":"509 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-02-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132479074","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}