Jiliang Zhang, Yaping Lin, Yongqiang Lyu, R. Cheung, Wenjie Che, Qiang Zhou, Jinian Bian
{"title":"Binding Hardware IPs to Specific FPGA Device via Inter-twining the PUF Response with the FSM of Sequential Circuits","authors":"Jiliang Zhang, Yaping Lin, Yongqiang Lyu, R. Cheung, Wenjie Che, Qiang Zhou, Jinian Bian","doi":"10.1109/FCCM.2013.12","DOIUrl":"https://doi.org/10.1109/FCCM.2013.12","url":null,"abstract":"The continuous growth in both capability and capacity for FPGA now requires significant resources invested in the hardware design, which results in two classes of main security issues: 1) the unauthorized use and piracy attacks including cloning, reverse engineering, tampering etc. 2) the licensing issue. Binding hardware IPs (HW-IPs) to specific FPGA devices can efficiently resolve these problems. However, previous binding techniques are all based on encryption and hence have three main drawbacks: 1) encryption-based proposals in commercial are limited to protect the single large FPGA configuration, 2) many encryption-based proposals depend on a trusted third party to involve the licensing protocol, and 3) the encryption-based binding methods use costly mechanisms such as secure ROM or flash memory to store FPGA specific cryptographic keys, which is not only expensive but also vulnerable to side-channel attacks, and the management and transport of secret keys became a practical issue. In this work, we propose a PUF-FSM binding technique completely different from the traditional encryption-based methods to address these shortcomings.","PeriodicalId":269887,"journal":{"name":"2013 IEEE 21st Annual International Symposium on Field-Programmable Custom Computing Machines","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2013-04-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129202515","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Parallel Generation of Gaussian Random Numbers Using the Table-Hadamard Transform","authors":"David B. Thomas","doi":"10.1109/FCCM.2013.53","DOIUrl":"https://doi.org/10.1109/FCCM.2013.53","url":null,"abstract":"Gaussian Random Number Generators (GRNGs) are an important component in parallel Monte-Carlo simulations using FPGAs, where tens or hundreds of high-quality Gaussian samples must be generated per cycle using very few logic resources. This paper describes the Table-Hadamard generator, which is a GRNG designed to generate multiple streams of random numbers in parallel. It uses discrete table distributions to generate pseudo-Gaussian base samples, then a parallel Hadamard transform to efficiently apply the central limit theorem. When generating 64 output samples the TableHadamard requires just 100 slices per generated sample, a quarter the resources of the next best technique, while providing higher statistical quality.","PeriodicalId":269887,"journal":{"name":"2013 IEEE 21st Annual International Symposium on Field-Programmable Custom Computing Machines","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2013-04-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125311885","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A Case for Heterogeneous Technology-Mapping: Soft Versus Hard Multiplexers","authors":"M. Purnaprajna, P. Ienne","doi":"10.1109/FCCM.2013.19","DOIUrl":"https://doi.org/10.1109/FCCM.2013.19","url":null,"abstract":"Lookup table-based FPGAs offer flexibility but compromise on performance, as compared to custom CMOS implementations. This paper explores the idea of minimising this performance gap by using fixed, fine-grained, nonprogrammable logic structures in place of lookup tables (LUTs). Functions previously mapped onto LUTs can now be diverted to these structures, resulting in reduced LUT usage and higher operating speed. This paper presents a generic heterogeneous technology-mapping scheme for segregating LUTs and hard logic blocks. For the proof-of-concept, we choose to isolate multiplexers present in most general-purpose circuits. These multiplexers are mapped onto hard blocks of multiplexers that are present in existing commercial FPGA fabrics, but often unused. Since the hard multiplexers are already present, there is no additional performance or area penalty. Using this approach, an average reduction in LUT usage of 16% and an average speedup of 8% has been observed for the VTR benchmarks as compared to the LUTs-only implementation.","PeriodicalId":269887,"journal":{"name":"2013 IEEE 21st Annual International Symposium on Field-Programmable Custom Computing Machines","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2013-04-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123480586","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"ShrinkWrap: Compiler-Enabled Optimization and Customization of Soft Memory Interconnects","authors":"Eric S. Chung, Michael Papamichael","doi":"10.1109/FCCM.2013.56","DOIUrl":"https://doi.org/10.1109/FCCM.2013.56","url":null,"abstract":"Today's FPGAs lack dedicated on-chip memory interconnects, requiring users to (1) rely on inefficient, general-purpose solutions, or (2) tediously create an application-specific memory interconnect for each target platform. The CoRAM architecture, which offers a general-purpose abstraction for FPGA memory management, encodes high-level application information that can be exploited to generate customized soft memory interconnects. This paper describes the ShrinkWrap Compiler, which analyzes a CoRAM application for its connectivity and bandwidth requirements, enabling synthesis of highly-tuned area-efficient soft memory interconnects.","PeriodicalId":269887,"journal":{"name":"2013 IEEE 21st Annual International Symposium on Field-Programmable Custom Computing Machines","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2013-04-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129978820","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Elementary Function Implementation with Optimized Sub Range Polynomial Evaluation","authors":"M. Langhammer, B. Pasca","doi":"10.1109/FCCM.2013.30","DOIUrl":"https://doi.org/10.1109/FCCM.2013.30","url":null,"abstract":"Efficient elementary function implementations require primitives optimized for modern FPGAs. Fixed-point function generators are one such type of primitives. When built around piecewise polynomial approximations they make use of memory blocks and embedded multipliers, mapping well to contemporary FPGAs. Another type of primitive which can exploit the power series expansions of some elementary functions is floating-point polynomial evaluation. The high costs traditionally associated with floating-point arithmetic made this primitive unattractive for elementary function implementation on FPGAs. In this work we present a novel and efficient way of implementing floating-point polynomial evaluators on a restricted input range. We show on the atan(x) function in double precision that this very different technique reduces memory block count by up to 50% while only slightly increasing DSP count compared to the best implementation built around polynomial approximation fixed-point primitives.","PeriodicalId":269887,"journal":{"name":"2013 IEEE 21st Annual International Symposium on Field-Programmable Custom Computing Machines","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2013-04-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131241193","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A Soft Coarse-Grained Reconfigurable Array Based High-level Synthesis Methodology: Promoting Design Productivity and Exploring Extreme FPGA Frequency","authors":"Cheng Liu, C. Y. Lin, Hayden Kwok-Hay So","doi":"10.1109/FCCM.2013.21","DOIUrl":"https://doi.org/10.1109/FCCM.2013.21","url":null,"abstract":"Compared to the use of a typical software development flow, the productivity of developing FPGA-based compute applications remains much lower. Although the use of high-level synthesis (HLS) tools may partly alleviate this shortcoming, the lengthy low-level FPGA implementation process remains a major obstacle to high productivity computing, limiting the number of compile-debug-edit cycles per day. Furthermore, high-level application developers often lack the intimate hardware engineering experience that is needed to achieve high performance on FPGAs, therefore undermining their usefulness as accelerators. To address the productivity and performance problems, a HLS methodology that utilizes soft coarse-grained reconfigurable arrays (SCGRAs) as an intermediate compilation step is presented. Instead of compiling high-level applications directly to circuits, the compilation process is reduced to an operation scheduling task targeting the SCGRA.","PeriodicalId":269887,"journal":{"name":"2013 IEEE 21st Annual International Symposium on Field-Programmable Custom Computing Machines","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2013-04-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134300630","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yunru Bai, Abigail Fuentes-Rivera, Mike Riera, Mohammed Alawad, Mingjie Lin
{"title":"Boosting Memory Performance of Many-Core FPGA Device through Dynamic Precedence Graph","authors":"Yunru Bai, Abigail Fuentes-Rivera, Mike Riera, Mohammed Alawad, Mingjie Lin","doi":"10.1109/FCCM.2013.39","DOIUrl":"https://doi.org/10.1109/FCCM.2013.39","url":null,"abstract":"Emerging FPGA device, integrated with abundant RAM blocks and high-performance processor cores, offers an unprecedented opportunity to effectively implement single-chip distributed logic-memory (DLM) architectures [1]. Being “memory-centric”, the DLM architecture can significantly improve the overall performance and energy efficiency of many memory-intensive embedded applications, especially those that exhibit irregular array data access patterns at algorithmic level. However, implementing DLM architecture poses unique challenges to an FPGA designer in terms of 1) organizing and partitioning diverse on-chip memory resources, and 2) orchestrating effective data transmission between on-chip and off-chip memory. In this paper, we offer our solutions to both of these challenges. Specifically, 1) we propose a stochastic memory partitioning scheme based on the well-known simulated annealing algorithm. It obtains memory partitioning solutions that promote parallelized memory accesses by exploring large solution space; 2) we augment the proposed DLM architecture with a reconfigure hardware graph that can dynamically compute precedence relationship between memory partitions, thus effectively exploiting algorithmic level memory parallelism on a per-application basis. We evaluate the effectiveness of our approach (A3) against two other DLM architecture synthesizing methods: an algorithmic-centric reconfigurable computing architectures with a single monolithic memory (A1) and the heterogeneous distributed architectures synthesized according to [1] (A2). To make our comparison fair, in all three architectures, the data path remains the same while local memory architecture differs. For each of ten benchmark applications from SPEC2006 and MiBench [2], we break down the performance benefit of using A3 into two parts: the portion due to stochastic local memory partitioning and the portion due to the dynamic graph-based memory arbitration. All experiments have been conducted with a Virtex-5 (XCV5LX155T-2) FPGA. On average, our experimental results show that our proposed A3 architecture outperforms A2 and A1 by 34% and 250%, respectively. Within the performance improvement of A3 over A2, more than 70% improvement comes from the hardware graph-based memory scheduling.","PeriodicalId":269887,"journal":{"name":"2013 IEEE 21st Annual International Symposium on Field-Programmable Custom Computing Machines","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2013-04-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115014902","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Accelerating the Computation of Induced Dipoles for Molecular Mechanics with Dataflow Engines","authors":"F. Pratas, D. Oriato, O. Pell, R. Mata, L. Sousa","doi":"10.1109/FCCM.2013.34","DOIUrl":"https://doi.org/10.1109/FCCM.2013.34","url":null,"abstract":"In Molecular Mechanics simulations, the treatment of electrostatics is the most computational intensive task. Modern force fields, such as the AMOEBA, which include explicit polarization effects, are particularly computationally demanding. We propose a static dataflow architecture for accelerating polarizable force fields. Results, obtained with Maxeler's MaxCompiler, show a speed-up factor of about 14x on a Maxeler 1U MaxNode, when compared to a 12-core CPU node while using half of the dataflow engine capacity. Projections for a full chip implementation indicate that speed-up results of up to 29x per node can be reached. Moreover, our implementation on the Maxeler system shows improvements between 2.5x and 4x compared to NVIDIA Fermibased GPUs. The current work shows the potential of dataflow engines in accelerating this field of applications.","PeriodicalId":269887,"journal":{"name":"2013 IEEE 21st Annual International Symposium on Field-Programmable Custom Computing Machines","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2013-04-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127803530","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
G. Carvajal, M. Figueroa, R. Trausmuth, S. Fischmeister
{"title":"Atacama: An Open FPGA-Based Platform for Mixed-Criticality Communication in Multi-segmented Ethernet Networks","authors":"G. Carvajal, M. Figueroa, R. Trausmuth, S. Fischmeister","doi":"10.1109/FCCM.2013.54","DOIUrl":"https://doi.org/10.1109/FCCM.2013.54","url":null,"abstract":"Ethernet is widely recognized as an attractive networking technology for modern distributed real-time systems. However, standard Ethernet components require specific modifications and hardware support to provide strict latency guarantees necessary for safety-critical applications. Although this is a well-stated fact, the design of hardware components for real-time communication remains mostly unexplored. This becomes evident from the few solutions reporting prototypes and experimental validation, which hinders the consolidation of Ethernet in real-world distributed applications. This paper presents Atacama, the first open-source framework based on reconfigurable hardware for mixed-criticality communication in multi-segmented Ethernet networks. Atacama uses specialized modules for time-triggered communication of real-time data, which seamlessly integrate with a standard infrastructure using regular best-effort traffic. Atacama enables low and highly predictable communication latency on multi-segmented 1Gbps networks, easy optimization of devices for specific application scenarios, and rapid prototyping of new protocol characteristics. Researchers can use the open-source design to verify our results and build upon the framework, which aims to accelerate the development, validation, and adoption of Ethernet-based solutions in real-time applications.","PeriodicalId":269887,"journal":{"name":"2013 IEEE 21st Annual International Symposium on Field-Programmable Custom Computing Machines","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2013-04-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121560578","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"PRML: A Modeling Language for Rapid Design Exploration of Partially Reconfigurable FPGAs","authors":"Rohit Kumar, A. Gordon-Ross","doi":"10.1109/FCCM.2013.24","DOIUrl":"https://doi.org/10.1109/FCCM.2013.24","url":null,"abstract":"Leveraging partial reconfiguration (PR) can improve system flexibility, cost, and performance/power/area tradeoffs over non-PR functionally-equivalent systems, however, realizing these benefits is challenging, time-consuming, and PR must be considered early during application design to reduce design exploration time and improve system quality. To facilitate realizing these benefits, we present an application design framework and an abstract modeling language for PR (PRML). By applying extensive PRML modeling guidelines to a complex arithmetic core, we show PRML's potential for efficient PR capability analysis, enabling designers to determine Pareto optimal systems during application formulation based on designer-specified area and performance metrics.","PeriodicalId":269887,"journal":{"name":"2013 IEEE 21st Annual International Symposium on Field-Programmable Custom Computing Machines","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2013-04-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117081157","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}