{"title":"FPGA Delay Model Considering Logic-Level and Transistor-Level Parameters","authors":"Qiang Liu, HanJing Qian","doi":"10.1109/FCCM.2017.16","DOIUrl":"https://doi.org/10.1109/FCCM.2017.16","url":null,"abstract":"Field programmable gate arrays (FPGAs) have been adopted in various fields, due to the design flexibility and customizability. Different applications have different requirements in performance, hardware resources and cost, leading to demands of diverse FPGA architectures. Delay is an important metric to evaluate different alternatives during FPGA architecture development. The existing analytical delay models for FPGAs mainly consider the logical architecture parameters. However, the variations of transistor-level parameters, Vdd and Vt, also have great influences on delay under the development trend of low-power design and deep sub-micron technology. To explore various design options at the early design stage and provide transistor-level accuracy, FPGA delay model considering Vdd and Vt is necessary. In this paper, an analytical model containing structural parameters of logic blocks and routing blocks as well as Vdd and Vt, is built to estimate the FPGA critical path delay.","PeriodicalId":124631,"journal":{"name":"2017 IEEE 25th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)","volume":"42 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130556926","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
D. Theodoropoulos, Nikolaos S. Alachiotis, D. Pnevmatikatos
{"title":"Multi-FPGA Evaluation Platform for Disaggregated Computing","authors":"D. Theodoropoulos, Nikolaos S. Alachiotis, D. Pnevmatikatos","doi":"10.1109/FCCM.2017.20","DOIUrl":"https://doi.org/10.1109/FCCM.2017.20","url":null,"abstract":"We present a versatile FPGA-based evaluation platformfor exploring alternative execution strategies on disaggregatedenvironments for applications, considering differentprocessing block types: compute cores, memory, and accelerators. Developers can interconnect different blocks types in orderto create optimal configurations. A user-level software libraryallows quick mapping of applications on real hardware. Wehave implemented a fully working prototype using three ZC706FPGA boards, and evaluated different software / hardwareconfigurations of a matrix multiplication benchmark.","PeriodicalId":124631,"journal":{"name":"2017 IEEE 25th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)","volume":"18 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130642558","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
John Mawer, Oscar Palomar, Cosmin Gorgovan, A. Nisbet, W. Toms, M. Luján
{"title":"The Potential of Dynamic Binary Modification and CPU-FPGA SoCs for Simulation","authors":"John Mawer, Oscar Palomar, Cosmin Gorgovan, A. Nisbet, W. Toms, M. Luján","doi":"10.1109/FCCM.2017.36","DOIUrl":"https://doi.org/10.1109/FCCM.2017.36","url":null,"abstract":"In this paper we describe a flexible infrastructure that can directly interface unmodified application executables with FPGA hardware acceleration IP in order to 1), facilitate faster computer architecture simulation, and 2), to prototype microarchitecture or accelerator IP. Dynamic binary modification tool plugins are directly interfaced to the application under evaluation via flexible software interfaces provided by a userspace hardware control library that also manages access to a parameterised Bluespec IP library. We demonstrate the potential of our infrastructure with two use cases with unmodified application executables where, 1), an executable is dynamically instrumented to generate load/store and program counter events that are sent to FPGA hardware accelerated in-order microarchitecture pipeline, and memory hierarchy models, and 2), the design of a branch predictor is prototyped using an FPGA. The key features of our infrastructure are the ability to instrument at instruction level granularity, to code exclusively at the user level, and to dynamically discover and use available hardware models at run time, thus, we enable software developers to rapidly investigate and evaluate parameterised Bluespec microarchitecture and accelerator IP models. We present a comparison between our system and GEM5, the industry standard ARM architecture simulator, to demonstrate accuracy and relative performance, even though our system is implemented on an Xilinx Zynq 7000 FPGA board with tightly coupled FPGA and ARM Cortex A9 processors, it outperforms GEM5 running on a Xeon with 32GBs of RAM (400x vs 700x slowdown over native execution).","PeriodicalId":124631,"journal":{"name":"2017 IEEE 25th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)","volume":"58 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132778291","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A Network-on-Chip Based H.264 Video Decoder Prototype Implemented on FPGAs","authors":"Ian J. Barge, Cristinel Ababei","doi":"10.1109/FCCM.2017.10","DOIUrl":"https://doi.org/10.1109/FCCM.2017.10","url":null,"abstract":"We present a field programmable gate array (FPGA) based implementation of the H.264 video decoder algorithm. The novelty of our design is that the communication between the decoder modules is done using a network-on-chip (NoC). This makes our design scalable and easily integrated within larger future NoC based systems, where the same hardware platform can host other algorithms such as compression, filtering, etc. Our primary objective is to study the achievable performance with a NoC based H.264 decoder solution. The design process involves primarily three main steps. First, the H.264 algorithm is split into eight different partitions, which are implemented as individual processing elements (PEs). These processing elements are attached to the routers of the regular mesh NoC and include: network abstraction layer (NAL) parser and entropy decoder, frame buffer and integer motion, inverse quantization inverse transform, intra prediction, luma sub-pixel motion, chroma sub-pixel motion, deblocking filter, and display driver. These PEs are described in VHDL with the first two being executed on Nios II softcores. The network-on-chip was generated with the Connect tool from Carnegie Mellon University and integrated within the top level design entity. Second, we specify the location of each of the PEs inside the regular mesh NoC. Because we use eight PEs, the NoC architecture needs to be a 3x3 regular mesh topology. When we specify the location of the PEs inside the mesh topology (i.e., specify the router to which a particular PE is attached), we effectively solve what is called the NoC mapping problem. To do that, we use manual mapping, which is done intelligently based on information about the internal structure of the decoding algorithm. This helps to reduce the number of routers that packets must travel through the network. Finally, the entire project is synthesized, placed, and routed with Quartus Prime Standard Edition 16.1 tool. The final design is tested and verified on the DE4 development board, which uses Altera's Stratix IV GX FPGA chip. The performance of the implementation at the time of the submission is that to decode 100 frames takes 33 seconds for a frame size of 192x144 pixels and to decode 100 frames takes 56 seconds for a resolution of 320x240 pixels per frame. Documentation and source codes of the entire project will be released to the public domain. We hope that this will enable other researchers to easily replicate and compare results to ours and that it will encourage and facilitate further research in the areas of image processing, computer vision, and advanced VHDL design and FPGAs.","PeriodicalId":124631,"journal":{"name":"2017 IEEE 25th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)","volume":"141 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132166418","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Emmanouil Kousanakis, A. Dollas, E. Sotiriades, I. Papaefstathiou, D. Pnevmatikatos, Athanasia Papoutsi, P. Petrantonakis, Panayiota Poirazi, Spyridon Chavlis, George Kastellakis
{"title":"An Architecture for the Acceleration of a Hybrid Leaky Integrate and Fire SNN on the Convey HC-2ex FPGA-Based Processor","authors":"Emmanouil Kousanakis, A. Dollas, E. Sotiriades, I. Papaefstathiou, D. Pnevmatikatos, Athanasia Papoutsi, P. Petrantonakis, Panayiota Poirazi, Spyridon Chavlis, George Kastellakis","doi":"10.1109/FCCM.2017.51","DOIUrl":"https://doi.org/10.1109/FCCM.2017.51","url":null,"abstract":"Neuromorphic computing is expanding by leaps and bounds through custom integrated circuits (digital and analog), and large scale platforms developed by industry or government-funded projects (e.g. TrueNorth and BrainScaleS, respectively). Whereas the trend is for massive parallelism and neuromorphic computation in order to solve problems, such as those that may appear in machine learning and deep learning algorithms, there is substantial work on brain-like highly accurate neuromorphic computing in order to model the human brain. In such a form of computing, spiking neural networks (SNN) such as the Hodgkin and Huxley model are mapped to various technologies, including FPGAs. In this work, we present a highly efficient FPGA-based architecture for the detailed hybrid Leaky Integrate and Fire SNN that can simulate generic characteristics of neurons of the cerebral cortex. This architecture supports arbitrary, sparse O(n2) interconnection of neurons without need to re-compile the design, and plasticity rules, yielding on a four-FPGA Convey 2ex hybrid computer a speedup of 923x for a non-trivial data set on 240 neurons vs. the same model in the software simulator BRAIN on a Intel(R) Xeon(R) CPU E5-2620 v2 @ 2.10GHz, i.e. the reference state-of-the-art software. Although the reference, official software is single core, the speedup demonstrates that the application scales well among multiple FPGAs, whereas this would not be the case in general-purpose computers due to the arbitrary interconnect requirements. The FPGA-based approach leads to highly detailed models of parts of the human brain up to a few hundred neurons vs. a dozen or fewer neurons on the reference system.","PeriodicalId":124631,"journal":{"name":"2017 IEEE 25th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)","volume":"145 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134033379","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"On Bit-Serial NoCs for FPGAs","authors":"Nachiket Kapre","doi":"10.1109/FCCM.2017.14","DOIUrl":"https://doi.org/10.1109/FCCM.2017.14","url":null,"abstract":"We can build lightweight bit-serial FPGA NoC routers thatcost 20 LUT, 17 FF per router and operate at 800–900 MHzspeeds. Each bit-serial router implements deflection-routing on aunidirectional torus topology requiring 1b-wide connection perport. The key ideas that enable this implementation are (1)reformulation of the dimension-ordered routing (DOR) functionusing compact 1 LUT, 1 FF streaming pattern matchers, (2)compact retiming of the datapath signals into SRL16 blocks, and(3) careful FPGA layout to efficiently pack the router logic intosmall rectangular regions 2×4 SLICEs on the chip. We anticipatethese bit-serial NoCs can be used in a variety of scenariosincluding overlay support for triggered debug, lightweight controlsignal dissemination, massively-parallel bit-serial processing.","PeriodicalId":124631,"journal":{"name":"2017 IEEE 25th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)","volume":"17 5","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133238767","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A Case for Common-Case: On FPGA Acceleration of Erasure Coding","authors":"R. Nakhjavani, Jianwen Zhu","doi":"10.1109/FCCM.2017.42","DOIUrl":"https://doi.org/10.1109/FCCM.2017.42","url":null,"abstract":"Reliable storage is central component of data centers that support private or public cloud. Erasure coding has becoming increasingly popular alternative to replication for its capability in substantially cutting disk cost while delivering the same reliability. This paper reports the comprehensive results of using FPGA for accelerating erasure encoding and decoding algorithms. In particular, to accomplish the best efficiency in throughput delivered per thousand LUTs, we argue it is best to allocate more resources to the common-case, which we show can be more than 90%, while reducing performance target for the general-case. With further innovations, we show, as an example, that for a RS(10,4) erasure code, and a 1.3% disk failure probability, a 6Gb/s/KLUT can be accomplished for 5 nines of reliability. In terms of power efficiency, our design is able to achieve 40Gb/s/Watt.","PeriodicalId":124631,"journal":{"name":"2017 IEEE 25th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)","volume":"20 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121673276","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"CAPSL: A Tool for Automatic Generation of Hardware Sandboxes for IP Security","authors":"Taylor J. L. Whitaker, C. Bobda","doi":"10.1109/FCCM.2017.54","DOIUrl":"https://doi.org/10.1109/FCCM.2017.54","url":null,"abstract":"We propose a design flow for automatic generation of hardware sandboxes. Our tool, the Component Authentication Process for Sandboxed Layouts (CAPSL), generates sandboxes capable of detecting trojan activation and nullifying potential damage to a system at run-time. Our approach captures the behavioral properties of non-trusted IPs with formal models that are translated to checker automata and implemented within a untrusted partition of the system to isolate sandbox-system interactions upon deviation from the behavioral checkers.","PeriodicalId":124631,"journal":{"name":"2017 IEEE 25th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)","volume":"81 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115763369","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Naveen Kumar Dumpala, S. B. Patil, Daniel E. Holcomb, R. Tessier
{"title":"Energy Efficient Loop Unrolling for Low-Cost FPGAs","authors":"Naveen Kumar Dumpala, S. B. Patil, Daniel E. Holcomb, R. Tessier","doi":"10.1109/FCCM.2017.22","DOIUrl":"https://doi.org/10.1109/FCCM.2017.22","url":null,"abstract":"Many FPGA computations, including block ciphers, require repetitive loop operations that are difficult to parallelize. Sequential loop implementation leads to significant clock powerwhile loop unrolling can lead to significant glitch power. In thispaper, we provide a low overhead approach to unroll blockciphers and other loops in low-cost FPGAs to reduce energyconsumption. A latch-based glitch filter is introduced for unrolledloops that reduces loop energy per operation by over an order ofmagnitude. Our filters and associated control for unrolled loopscan be automatically instantiated as a macro for FPGA designs, allowing for easy designer use. We demonstrate our approach forSIMON-128 and AES-256 block ciphers implemented on a XilinxArtix-7 FPGA.","PeriodicalId":124631,"journal":{"name":"2017 IEEE 25th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)","volume":"39 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116736032","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
M. Hosseini, Rashidul Islam, A. Kulkarni, T. Mohsenin
{"title":"A Scalable FPGA-Based Accelerator for High-Throughput MCMC Algorithms","authors":"M. Hosseini, Rashidul Islam, A. Kulkarni, T. Mohsenin","doi":"10.1109/FCCM.2017.56","DOIUrl":"https://doi.org/10.1109/FCCM.2017.56","url":null,"abstract":"Markov Chain Monte Carlo (MCMC) algorithms are used to obtain samples from any target probability distribution and are widely used in stochastic processing techniques. Stochastic processing techniques such as machine learning and image processing need to compute large amounts of data in real-time, thus high throughput MCMC samplers are of utmost importance. Parallel Tempering (PT) MCMC has proven better mixing and convergence for high-dimensional and multi-modal distributions compared to other popular MCMC algorithms. In this paper, we employ a special case of Dth order Markov chains to modify the PT-MCMC algorithm, named \"Multiple Parallel Tempering\" (MPT). The modification converts one MCMC sampler into multiple independent samplers that generate and interleave their samples on one output line each clock cycle. A fully scalable and pipelined hardware accelerator for the PT and proposed MPT sampler is designed and implemented on Artix-7 Xilinx FPGA for chain numbers of 1, 2, and 8. The post-place and route FPGA implementation results indicate that the throughput of the proposed MPT sampler for chain numbers 1, 2, and 8 achieves 31x, 31x, and 28x respectively higher as compared to PT sampler with the same chain number configuration.","PeriodicalId":124631,"journal":{"name":"2017 IEEE 25th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)","volume":"30 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133719398","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}