{"title":"Hardware acceleration of biomedical models with OpenCMISS and CellML","authors":"Ting-Rong Yu, C. Bradley, O. Sinnen","doi":"10.1109/FPT.2013.6718390","DOIUrl":"https://doi.org/10.1109/FPT.2013.6718390","url":null,"abstract":"OpenCMISS is a mathematical modeling environment designed to solve field based equations and link subcellular and tissue-level biophysical processes to organ-level processes. It employs a general purpose parallel design, in particular distributed memory, for its computations. CellML is a mark up language based on XML that is designed to encode lumped parameter biophysically based systems of ordinary differential equations and nonlinear algebraic equations. OpenCMISS allows CellML models to be evaluated and integrated into models at various spatial and temporal scales. With good inherent parallelism, hardware acceleration based on FPGAs has a great potential to increase the computational performance and to reduce the energy consumption of computations with CellML models integrated in OpenCMISS. However, with several hundred CellML models, manual hardware implementation for each CellML model is complex and time consuming. The advantages of FPGA designs will only be realised if there is a general solution or a tool to automatically convert CellML models into hardware description languages such as VHDL. In this paper we describe the architecture for the FPGA hardware implementation of CellML models and evaluate the first results related to performance and resource usage based on a variety of criteria.","PeriodicalId":344469,"journal":{"name":"2013 International Conference on Field-Programmable Technology (FPT)","volume":"19 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122051950","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Derivation of efficient FSM from loop nests","authors":"Tomofumi Yuki, Antoine Morvan, Steven Derrien","doi":"10.1109/FPT.2013.6718367","DOIUrl":"https://doi.org/10.1109/FPT.2013.6718367","url":null,"abstract":"Pipelined execution is one of the most important optimizations in hardware design to improve hardware utilization rate, and hence the throughput. Loop pipelining is a transformation available in High Level Synthesis tools to execute multiple iterations of a loop in a pipeline. Nested loop pipelining is a related technique that improves hardware utilization rate when the iteration count of the innermost loop is small. However, it is also known to increase the complexity of the control, and hence degrading frequency. In this paper, we present an automatic transformation targeting HLS that improves the effectiveness of nested loop pipelining, by efficient implementations of the control-path. Specifically, we present (i) an analytical model that captures the trade-off between gain in cycles and loss in frequency, (ii), automatic derivation of efficient Finite State Machine from loop nests, and (iii) an efficient implementation of the derived FSM that improves the performance of synthesized hardware.","PeriodicalId":344469,"journal":{"name":"2013 International Conference on Field-Programmable Technology (FPT)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127694836","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Flexible hierarchy ray tracing on FPGAs","authors":"Sam Collinson, O. Sinnen","doi":"10.1109/FPT.2013.6718379","DOIUrl":"https://doi.org/10.1109/FPT.2013.6718379","url":null,"abstract":"Rendering programs use ray tracing to artificially create photo-realistic scenes that would otherwise be too dangerous, too costly or physically impossible to fabricate. Acceleration of the rendering process can be achieved through spatial or object hierarchy structures, which aim to restrict the number of expensive ray-object intersection calculations along a ray path by trading them for traversal of the structure. With extensive inherent parallelism, ray tracing benefits from GPU acceleration but may also benefit from the more flexible control flow and memory architecture available with FPGAs. We present a flexible FPGA based ray tracing platform capable of traversing varying widths and types of acceleration hierarchies to evaluate their efficiency. The platform consists of four main controllers for communication, traversal, intersection and memory. The platform interfaces with LuxRays, an open-source C++ renderer, over PCIexpress to transfer data for computation to onboard memory. We implement a configuration of the platform at 250MHz on our target device that shows promising results compared to CPU and GPU renders.","PeriodicalId":344469,"journal":{"name":"2013 International Conference on Field-Programmable Technology (FPT)","volume":"3 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128131628","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Teaching FPGA security","authors":"L. Bossuet","doi":"10.1109/FPT.2013.6718373","DOIUrl":"https://doi.org/10.1109/FPT.2013.6718373","url":null,"abstract":"Teaching FPGA security to electrical engineering students is new at graduate level. It requires a wide field of knowledge and a lot of time. This paper describes a compact course on FPGA security that is available to electrical engineering master's students at the Saint-Etienne Institute of Telecom, University of Lyon, France. It is intended for instructors who wish to design a new course on this topic. The paper reviews the motivation for the course, the pedagogical issues involved, the curriculum, the lab materials and tools used, and the results. Details are provided on two original lab sessions, in particular, a compact lab that requires students to perform differential power analysis of FPGA implementation of the AES symmetric cipher.","PeriodicalId":344469,"journal":{"name":"2013 International Conference on Field-Programmable Technology (FPT)","volume":"44 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124310143","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A 66.1 Gbps single-pipeline AES on FPGA","authors":"Qiang Liu, Zhenyu Xu, Ye Yuan","doi":"10.1109/FPT.2013.6718392","DOIUrl":"https://doi.org/10.1109/FPT.2013.6718392","url":null,"abstract":"Targeting real-time encryption/decryption of high speed data communication, this paper proposes an FPGA-based high throughput AES design. The critical functions involved in AES are broken into elementary logic operations to gain the deep insight into the performance bottleneck. With respect to FPGA structures, a datapath with two balanced pipeline stages is determined for each of the encryption/decryption rounds. Meanwhile, a new key expansion scheme with additional nonlinear operations is proposed to increase the security of the AES implementation and is well matched to the two-stage pipelining datapath. The design is evaluated on various FPGA devices and is compared with several existing AES implementations. Results show that in terms of both throughput and throughput per slice the proposed AES design with single pipeline can overcome most existing designs and achieves a throughput of 66.1 Gbps on a latest FPGA device.","PeriodicalId":344469,"journal":{"name":"2013 International Conference on Field-Programmable Technology (FPT)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124423934","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Multi-personality partitioning for heterogeneous systems","authors":"A. Gregerson, Aman Chadha, Katherine Morrow","doi":"10.1109/FPT.2013.6718375","DOIUrl":"https://doi.org/10.1109/FPT.2013.6718375","url":null,"abstract":"Design flows use graph partitioning both as a precursor to place and route for single devices, and to divide netlists or task graphs among multiple devices. Partitioners have accommodated FPGA heterogeneity via multi-resource constraints, but have not yet exploited the corresponding ability to implement some computations in multiple ways (e.g., LUTs vs. DSP blocks), which could enable a superior solution. This paper introduces multi-personality graph partitioning, which incorporates aspects of resource mapping into partitioning. We present a modified multi-level KLFM partitioning algorithm that also performs heterogeneous resource mapping for nodes with multiple potential implementations (multiple personalities). We evaluate several variants of our multi-personality FPGA circuit partitioner using 21 circuits and benchmark graphs, and show that dynamic resource mapping improves cut size on average by 27% over static mapping for these circuits. We further show that it improves deviation from target resource utilizations by 50% over post-partitioning resource mapping.","PeriodicalId":344469,"journal":{"name":"2013 International Conference on Field-Programmable Technology (FPT)","volume":"8 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126421974","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"The architecture and placement algorithm for a uni-directional routing based 3D FPGA","authors":"Junsong Hou, Heng Yu, Yajun Ha, Xin Liu","doi":"10.1109/FPT.2013.6718325","DOIUrl":"https://doi.org/10.1109/FPT.2013.6718325","url":null,"abstract":"Three-Dimensional (3D) FPGA as a promising design trend, achieves significant performance improvement over conventional 2D-based FPGA. The maturity of the uni-directional routing architecture design, which achieves 25% area saving in area-delay-product (ADP) over bi-directional routing architectures, has driven major vendors such as Xilinx and Altera to switch to such architecture in their 2D-based products. However, few studies were contributed to exploring performance-optimal uni-directional 3D routing architectures. In this paper, we propose and evaluate a novel uni-directional 3D routing architecture named UNI-3D. Additionally, in the EDA counterpart, we also propose an improved simulated annealing (SA)-based placement algorithm that caters the unidirectional architecture, to alleviate signal propagation imbalance in the vertical channels resulted from using conventional bi-directional based SA approach. Our simulation results show that our proposed architecture is able to achieve up to 28.44% of delay reduction and 26.21% planar channel width reduction compared with the baseline 2D uni-directional architecture. At the same time, the proposed SA algorithm is able to improve the average vertical channel width up to 16% compared to state-of-the-art works.","PeriodicalId":344469,"journal":{"name":"2013 International Conference on Field-Programmable Technology (FPT)","volume":"105 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127409457","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Efficient methods for out-of-order load/store execution for high-performance soft processors","authors":"Henry Wong, Vaughn Betz, Jonathan Rose","doi":"10.1109/FPT.2013.6718409","DOIUrl":"https://doi.org/10.1109/FPT.2013.6718409","url":null,"abstract":"As FPGAs continue to increase in size, it becomes increasingly feasible and desirable to build higher performance soft processors. Preserving the familiar single-threaded programming model can be done with an out of order processor. The ability to execute memory loads and stores out of order has a large impact on performance, but this is difficult to do because the dependencies between stores and loads are not known until addresses are computed. Out of order memory disambiguation is traditionally done with CAMs in the load queue and store queue, but large CAMs are inefficient on FPGAs. Store Queue Index Prediction (SQIP) and NoSQ propose to replace CAMs with store-load forwarding prediction and load re-execution. We implement four memory disambiguation schemes (in-order, CAM, SQIP, NoSQ) on a Stratix IV FPGA and evaluate the area and delay trade-offs. We find that CAM area and delay degrade quickly with load/store queue size, while SQIP and NoSQ have little degradation with queue size but have area overhead for prediction and predictor training hardware. SQIP and NoSQ use less area than CAMs beyond 32 and 16 load/store queue entries, respectively, and have higher maximum frequency beyond 4 entries.","PeriodicalId":344469,"journal":{"name":"2013 International Conference on Field-Programmable Technology (FPT)","volume":"76 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115270642","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Datapath fault tolerance for parallel accelerators","authors":"James J. Davis, P. Cheung","doi":"10.1109/FPT.2013.6718389","DOIUrl":"https://doi.org/10.1109/FPT.2013.6718389","url":null,"abstract":"While we reap the benefits of process scaling in terms of transistor density and switching speed, consideration must be given to the negative effects it causes: increased variation, degradation and fault susceptibility. Above device level, such phenomena and the faults they induce can lead to reduced yield, decreased system reliability and, in extreme cases, total failure after a period of successful operation. Although error detection and correction are almost always considered for highly sensitive and susceptible applications such as those in space, for other, more general-purpose applications they are often overlooked. In this paper, we present a parallel matrix multiplication accelerator running in hardware on the Xilinx Zynq system-on-chip platform, along with `bolt-on' logic for detecting, locating and avoiding faults within its datapath. Designs of various sizes are compared with respect to resource overhead and performance impact. Our largest-implemented fault-tolerant accelerator was found to consume 17.3% more area, run at a 3.95% lower frequency and incur an 18.8% execution time penalty over its equivalent fault-susceptible design during fault-free operation.","PeriodicalId":344469,"journal":{"name":"2013 International Conference on Field-Programmable Technology (FPT)","volume":"13 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1900-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123852075","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}