{"title":"Design and optimization of heterogeneous tree-based FPGA using 3D technology","authors":"V. Pangracious, Z. Marrakchi, H. Mehrez","doi":"10.1109/FPT.2013.6718380","DOIUrl":"https://doi.org/10.1109/FPT.2013.6718380","url":null,"abstract":"The CMOS technology scaling has greatly improved the overall performance and density of Field Programmable Gate Arrays (FPGAs). However, when looking at the performance metrics such as speed, area and power consumption, the gap is generally very wide for FPGAs compared to application specific integrated circuits (ASICs) mainly due to the programmable interconnect overhead. We propose a 3-dimensional (3D) design methodology using horizontal design partitioning to vertically stack heterogeneous FPGA designs based on a Tree-based multilevel FPGA architecture. We describe the 3D design and optimization methodology to improve speed, interconnect area and power consumption using Tezzaron's 3D stacking technology.","PeriodicalId":344469,"journal":{"name":"2013 International Conference on Field-Programmable Technology (FPT)","volume":"104 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-12-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133811529","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Correction to “Graph Minor Approach for Application Mapping on CGRAs”","authors":"Liang Chen, T. Mitra","doi":"10.1109/FPT.2013.6718431","DOIUrl":"https://doi.org/10.1109/FPT.2013.6718431","url":null,"abstract":"Following the publication of the article “Graph Minor Approach for Application Mapping on CGRAs” [1] in the proceedings of the International Conference on Field Programmable Technology (ICFPT) 2012, we received correspondence [2] pointing to some inaccuracies in the article. With this correction, we would like to clarify some points that could otherwise be misconstrued.","PeriodicalId":344469,"journal":{"name":"2013 International Conference on Field-Programmable Technology (FPT)","volume":"1378 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123101555","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
S. Denholm, Hiroaki Inoue, Takashi Takenaka, W. Luk
{"title":"Application-specific customisation of market data feed arbitration","authors":"S. Denholm, Hiroaki Inoue, Takashi Takenaka, W. Luk","doi":"10.1109/FPT.2013.6718377","DOIUrl":"https://doi.org/10.1109/FPT.2013.6718377","url":null,"abstract":"Messages are transmitted from financial exchanges to update their members about changes in the market. As UDP packets are used for message transmission, members subscribe to two identical message feeds from the exchange to lower the risk of message loss or delay. As financial trades can be time sensitive, low latency arbitration between these market data feeds is of particular importance. Members must either provide generic arbitration for all of their financial applications, increasing latency, or arbitrate within each application which wastes resources and scales poorly. We present a reconfigurable accelerated approach for market feed arbitration operating at the network level. Multiple arbitrators can operate within a single FPGA to output customised feeds to downstream financial applications. Application-specific customisations are supported by each core, allowing different market feed messaging protocols, windowing operations and message buffering parameters. We model multiple-core arbitration and explore the scalability and performance improvements within and between cores. We demonstrate our design within a Xilinx Virtex-6 FPGA using the NASDAQ TotalView-ITCH 4.1 messaging standard. Our implementation operates at 16Gbps throughput, and with resource sharing, supports 12 independent cores, 33% more than simple core replication. A 56ns (7 clock cycles) windowing latency is achieved, 2.6 times lower than a hardware-accelerated CPU approach.","PeriodicalId":344469,"journal":{"name":"2013 International Conference on Field-Programmable Technology (FPT)","volume":"17 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127116781","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
M. Klaiber, D. Bailey, Silvia Ahmed, Y. Baroud, S. Simon
{"title":"A high-throughput FPGA architecture for parallel connected components analysis based on label reuse","authors":"M. Klaiber, D. Bailey, Silvia Ahmed, Y. Baroud, S. Simon","doi":"10.1109/FPT.2013.6718372","DOIUrl":"https://doi.org/10.1109/FPT.2013.6718372","url":null,"abstract":"A memory efficient architecture for single-pass connected components analysis suited for high throughput embedded image processing systems is proposed which achieves a high throughput by partitioning the image into several vertical slices processed in parallel. The low latency of the architecture allows reuse of labels associated with the image objects. This reduces the amount of memory by a factor of more than 5 compared to previous work. This is significant, since memory is a critical resource in embedded image processing on FPGAs.","PeriodicalId":344469,"journal":{"name":"2013 International Conference on Field-Programmable Technology (FPT)","volume":"23 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125017785","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Zhenyu Guan, Justin S. J. Wong, S. Chaudhuri, G. Constantinides, P. Cheung
{"title":"Exploiting stochastic delay variability on FPGAs with adaptive partial rerouting","authors":"Zhenyu Guan, Justin S. J. Wong, S. Chaudhuri, G. Constantinides, P. Cheung","doi":"10.1109/FPT.2013.6718362","DOIUrl":"https://doi.org/10.1109/FPT.2013.6718362","url":null,"abstract":"Aggressive transistor scaling will soon lead us to the physical upper-bound of process technology, where stochastic process variability dominates the timing performance of FPGA components. In this paper, a variation-aware partial-rerouting method is proposed to mitigate and take advantage of the effect of delay variability due to process variation. The variation in logic delay across each FPGA (variation map) is measured on commercial FPGAs and is used to assess the effectiveness and potential gain of the proposed method on current FPGA architectures. Our partial-rerouting method achieved 5.25% improvement in critical path delay under a delay variability of σ/μ = 0.3, and is considerably less time consuming than using variation-aware full chipwise routing, which gave a slightly better timing gain of 6.41% but requires 8x more execution time when optimising for 100 target FPGAs with unique variation maps.","PeriodicalId":344469,"journal":{"name":"2013 International Conference on Field-Programmable Technology (FPT)","volume":"66 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122819393","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Quantum FPGA architecture design","authors":"Jialin Chen, Lingli Wang, Bin Wang","doi":"10.1109/FPT.2013.6718386","DOIUrl":"https://doi.org/10.1109/FPT.2013.6718386","url":null,"abstract":"A Quantum FPGA (QFPGA) architecture is presented for programmable quantum computing, which is a hybrid architecture combining the advantages of the measurement-based quantum computation and the qubus system. QFPGA consists of Quantum Logic Blocks (QLBs) and Quantum Routing Channels (QRCs). The QLB is used to realize a small quantum logic while the QRC is to combine them properly for larger logic realization. There are two types of buses in QFPGA, the local bus in the QLB and the global bus in the QRC, which are to generate the cluster states and general multiqubit rotations around the z axis respectively. However for some applications such as Grover's algorithm and n-qubit quantum Fourier transform, one QLB can be configured for four-qubit phase shift module and four-qubit quantum Fourier transform respectively.","PeriodicalId":344469,"journal":{"name":"2013 International Conference on Field-Programmable Technology (FPT)","volume":"36 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114392806","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Implementation of a highly scalable blokus duo solver on FPGA","authors":"Chester Liu","doi":"10.1109/FPT.2013.6718423","DOIUrl":"https://doi.org/10.1109/FPT.2013.6718423","url":null,"abstract":"This paper presents a highly scalable hardware solver for Blokus Duo. Based on flat Monte Carlo method, the proposed solver contains self-contained agents whose number is configurable and only limited by FPGA capacity, which makes the proposed solver highly scalable. Data structures and tile representations are tailored to support efficient memory usage and operations. Implementation result shows that an agent can operate at up to 150MHz while requiring less than 3000 LUTs on the Altera Cyclone II EP2C70F896C6 FPGA device. Simulation result shows the proposed solver can always win level 1 Pentobi.","PeriodicalId":344469,"journal":{"name":"2013 International Conference on Field-Programmable Technology (FPT)","volume":"4 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117073888","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Dustin Peterson, O. Bringmann, Thomas Schweizer, W. Rosenstiel
{"title":"StML: Bridging the gap between FPGA design and HDL circuit description","authors":"Dustin Peterson, O. Bringmann, Thomas Schweizer, W. Rosenstiel","doi":"10.1109/FPT.2013.6718366","DOIUrl":"https://doi.org/10.1109/FPT.2013.6718366","url":null,"abstract":"FPGA circuit implementation is a unidirectional and time-consuming process. Existing approaches like the incremental synthesis try to shorten it, but still need to execute the whole flow for a changed circuit partition. Other approaches circumvent process stages by providing bidirectional mappings between their results. In this paper we propose an approach to provide a bidirectional link between an FPGA design and its HDL code. This link enables the circumvention of the most time-consuming stages (synthesis, mapping, placing, routing) of the FPGA circuit implementation. We implemented our approach in a Java-based EDA tool library, called Static Mapping Library (StML). We demonstrate its applicability by means of hardware debugging and an RTL-based injection of permanent faults, built on top of the StML. Experimental results illustrate that a mapping coverage between 98.5%-100.0% can be obtained, which substantiates the feasibility of this approach. Further experiments illustrate a controllable tradeoff between area overhead, circuit granularity and mapping granularity. With the finest mapping granularity, the area overhead has been between 1.8% and 60.2% for RTL-based circuits. The speedup of the proposed fault injection method has been estimated to be up to 6x for the tested circuits.","PeriodicalId":344469,"journal":{"name":"2013 International Conference on Field-Programmable Technology (FPT)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129707987","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"ZCluster: A Zynq-based Hadoop cluster","authors":"Zhongduo Lin, P. Chow","doi":"10.1109/FPT.2013.6718411","DOIUrl":"https://doi.org/10.1109/FPT.2013.6718411","url":null,"abstract":"ARM-based servers are garnering increasing interest in big data processing for their low power consumption. However, they are ill-suited for compute-intensive tasks due to their poor processing capability compared to the CPUs used in a traditional server. This paper describes our early efforts to integrate the processing power of the FPGA with the ARM processor inside the Xilinx Zynq SoC. An eight-slave Zynq-based Hadoop cluster is built and a customized hardware accelerator for a standard FIR filter is implemented to demonstrate the effectiveness of hardware acceleration. The Xillybus is used for communication between the ARM processor and the FPGA fabric, achieving a bandwidth of 103MB/s. The Hadoop cluster is proved to be linearly scalable with different input sizes and numbers of slaves. Overall, the cluster achieves a 3.3-fold speedup compared to a native pure software implementation on a single ARM processor and about a 20% improvement compared to an ARM-based cluster without hardware accelerators.","PeriodicalId":344469,"journal":{"name":"2013 International Conference on Field-Programmable Technology (FPT)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129790838","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A connection-based router for FPGAs","authors":"Elias Vansteenkiste, Karel Bruneel, D. Stroobandt","doi":"10.1109/FPT.2013.6718378","DOIUrl":"https://doi.org/10.1109/FPT.2013.6718378","url":null,"abstract":"The FPGA's interconnection network not only requires the larger portion of the total silicon area in comparison to the logic available on the FPGA, it also contributes to the majority of the delay and power consumption. Therefore it is essential that routing algorithms are as efficient as possible. In this work the connection router is introduced. It is capable of partially ripping up and rerouting the routing trees of nets. To achieve this, the main congestion loop rips up and reroutes connections instead of nets, which allows the connection router to converge much faster to a solution. The connection router is compared with the VPR directed search router on the basis of VTR benchmarks on a modern commercial FPGA architecture. It is able to find routing solutions 4.4% faster for a relaxed routing problem and 84.3% faster for hard instances of the routing problem. And given the same amount of time as the VPR directed search, the connection router is able to find routing solutions with 5.8% less tracks per channel.","PeriodicalId":344469,"journal":{"name":"2013 International Conference on Field-Programmable Technology (FPT)","volume":"22 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121354276","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}