{"title":"A high speed design and implementation of dynamically reconfigurable processor using 28NM SOI technology","authors":"Toru Katagiri, H. Amano","doi":"10.1109/FPL.2014.6927438","DOIUrl":"https://doi.org/10.1109/FPL.2014.6927438","url":null,"abstract":"Although dynamically reconfigurable processor arrays (DRPAs) are advantageous for embedded devices because of their high energy efficiency, many of the recent mobile devices are required to execute increasingly performance-centric jobs. One fairly straingtfoward way of increasing the clock frequency is introducing a pipelined structure into each PE. However, this results in frequent pipeline stalls due to the data hazard between multiple PEs. In order to mitigate the effect of data hazard between PEs, we propose a tiny vector instruction mechanism. With a single vector instruction, a small amount of data is continuously processed in the pipeline of the PE. Pipeline stalls are removed without increasing the number of hardware contexts, and thus the amount of configuration data. Evaluation results based on the implementation using 28nm SOI process technology, a DRPA with tiny vector instructions (DRPA-TVI) improves the performance by 2.4 three times compared to a base DRPA with just a small increase of area and power consumption.","PeriodicalId":172795,"journal":{"name":"2014 24th International Conference on Field Programmable Logic and Applications (FPL)","volume":"7 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-10-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126445142","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Zhenghong Jiang, C. Y. Lin, Liqun Yang, Fei Wang, Haigang Yang
{"title":"Exploring architecture parameters for dual-output LUT based FPGAs","authors":"Zhenghong Jiang, C. Y. Lin, Liqun Yang, Fei Wang, Haigang Yang","doi":"10.1109/FPL.2014.6927470","DOIUrl":"https://doi.org/10.1109/FPL.2014.6927470","url":null,"abstract":"Dual-output lookup tables (LUTs) are mainstream in the design of commercial FPGA products. A detailed exploration of architectural parameters of FPGAs based on dualoutput LUTs is presented. Different from traditional single-output LUT based architecture, “shared inputs” between the sub-LUTs is a new parameter specific to dual-output architecture. In this paper, we focus on the effect of ratio of shared inputs on the performance and area-efficiency. First, we study the required cluster inputs and derive a relationship between cluster inputs, LUT size and cluster size under different ratios of shared inputs. Secondly, our evaluation results show that a FPGA with 4-LUTs and a shared input ratio of two thirds is preferred for area-efficiency, while a large LUT size of 9 with no shared inputs achieves best performance. Finally, we determine that a LUT size of 4, a cluster size from 3 to 8, and a shared input ratio between 1/3 and 2/3, provide the best area-delay product for dual-output LUT based FPGAs.","PeriodicalId":172795,"journal":{"name":"2014 24th International Conference on Field Programmable Logic and Applications (FPL)","volume":"70 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-10-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116583184","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Rui Jia, C. Y. Lin, Zhenhong Guo, R. Chen, Fei Wang, Tongqiang Gao, Haigang Yang
{"title":"A survey of open source processors for FPGAs","authors":"Rui Jia, C. Y. Lin, Zhenhong Guo, R. Chen, Fei Wang, Tongqiang Gao, Haigang Yang","doi":"10.1109/FPL.2014.6927482","DOIUrl":"https://doi.org/10.1109/FPL.2014.6927482","url":null,"abstract":"FPGA-based SoCs have been gaining great favors among traditional applications and expanding the application domains to some new areas. Reusing and sharing components is an attractive and pratical methodology for system designers to reduce the design complexity of the SoC architectures. Open source hardware has become an effective method of improving the design productivity. As the core functional component, processors affects the performance of SoC systems. This paper investigates existing open source processors, and gives an overview. From the points of usability and stability, the main features of open source processors are summarized. Following these features, some open source processors with high usability and stability are selected. These open source processors and existing vendor-provided soft processors are implemented on Stratix V and Virtex-7 FPGAs using corresponding EDA tools, Quart us n and ISE. The implementation results are compared and discussed.","PeriodicalId":172795,"journal":{"name":"2014 24th International Conference on Field Programmable Logic and Applications (FPL)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-10-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130596637","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Biomedical image processing and reconstruction with dataflow computing on FPGAs","authors":"F. Grüll, U. Kebschull","doi":"10.1109/FPL.2014.6927378","DOIUrl":"https://doi.org/10.1109/FPL.2014.6927378","url":null,"abstract":"Increasing chip sizes and better programming tools have made it possible to increase the boundaries of application acceleration with FPGAs. Two applications, localization microscopy and electron tomography, are presented in the author's PhD thesis and summarized in this paper. Both have been ported from imperative languages to the dataflow paradigm that maps well onto long processing pipelines in custom hardware. The results show that an acceleration of 200 compared to an Intel i5 450 CPU for localization microscopy, and an acceleration of 5 over an Nvidia Tesla C1060 for electron tomography while maintaining full accuracy. The main challenge arose from the need to fully understand and re-write most of the imperative source in a form suitable for dataflow computing.","PeriodicalId":172795,"journal":{"name":"2014 24th International Conference on Field Programmable Logic and Applications (FPL)","volume":"11 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-10-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130704907","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Heterogeneous Heartbeats: A framework for dynamic management of Autonomous SoCs","authors":"Shane T. Fleming, David B. Thomas","doi":"10.1109/FPL.2014.6927489","DOIUrl":"https://doi.org/10.1109/FPL.2014.6927489","url":null,"abstract":"Modern computer systems are formed from many interacting systems and heterogeneous components, that face increasing constraints on performance, power consumption, and temperature. Such systems have complex run-time dynamics which cannot easily be predicted or modelled at design-time, creating a need for online dynamic systems management. The Heartbeats API is a popular open source project which provides a standardised way for applications to monitor and publish their progress in multi-core CPU systems, but it does not allow hardware components to be monitored or to observe the progress of other components of the system. This paper presents work which extends the capacities of the Heartbeats API across the whole system while maintaining backwards compatibility with the legacy software API. To demonstrate the framework's capabilities an Autonomous Underwater Vehicle (AUV) case study is explored, where a power-aware HW/SW image processing application is implemented on a reconfigurable SoC and an approximate energy saving of 30% is observed for an example input video. Current progress is also discussed on some applications which build upon the framework, including an CubeSat experiment for an Adaptive Heterogeneous FDIR system that will launch in 2016 by the European Space Agency.","PeriodicalId":172795,"journal":{"name":"2014 24th International Conference on Field Programmable Logic and Applications (FPL)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-10-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131099729","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Secure partial dynamic reconfiguration with unsecured external memory","authors":"H. Kashyap, R. Chaves","doi":"10.1109/FPL.2014.6927477","DOIUrl":"https://doi.org/10.1109/FPL.2014.6927477","url":null,"abstract":"This paper proposes a solution to improve the security of the partial dynamic reconfiguration of FPGA, without significantly affecting the reconfiguration performance. The existing solutions for secure partial dynamic reconfiguration on SRAM based FPGAs impact the reconfiguration process and the available resources due to their complex multi-layered partial bitstream validation process. This adversely affects the performance of applications using reconfigurable hardware. The proposed solution uses high performance encryption engines to change the encryption key of the remotely received bitstream by a randomly generated key, unique to each configuration, when storing the bitstream in the external unsecured memory. An additional CBC-MAC authentication mechanism is also considered that combined with the frame-wise error detection mechanism of the configuration port, allows for an improved countermeasure against replay attack and wrongful bitstream usage. The proposed solution introduces a resource overhead of 1.1% in regard to the base reconfigurable system and provides the lowest impact on the reconfiguration process when compared to the related state of the art, achieving a reconfiguration throughput of 2.5 Gbps.","PeriodicalId":172795,"journal":{"name":"2014 24th International Conference on Field Programmable Logic and Applications (FPL)","volume":"12 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-10-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131432381","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Using buffer-to-BRAM mapping approaches to trade-off throughput vs. memory use","authors":"Jasmina Vasiljevic, P. Chow","doi":"10.1109/FPL.2014.6927469","DOIUrl":"https://doi.org/10.1109/FPL.2014.6927469","url":null,"abstract":"One of the challenges in designing high-performance FPGA applications is fine-tuning the use of limited on-chip memory storage among many buffers in an application. To achieve desired performance and meet the on-chip memory budget requirements, the designer faces the burden of manually assigning application buffers to physical on-chip memories. Mismatches between dimensions (bit-width and depth) of buffers and physical on-chip memories lead to underutilized memories. Memory utilization can be increased via buffer packing - grouping buffers together and implementing them as a single memory, at the expense of data throughput. However, identifying buffer groups that result in the least amount of physical memory is a combinatorial problem with a large search space. This process is time consuming and non-trivial, particularly with a large number of buffers of various depths and bit widths. Previous work [1] introduced a tool that provides high-level pragmas allowing the user to specify global memory requirements, such as an application's on-chip memory budget and data throughput. This paper extends the previous work by introducing two low-level pragmas that specify information about memory access patterns, resulting in an improved on-chip memory utilization up to 22%. Further, we develop a simulated annealing based buffer packing algorithm, which reduces the tool's run-time from over 30 mins down to 15 sec, with an improvement in performance in the generated memory solution. Finally, we demonstrate the effectiveness of our tool with four stream application benchmarks.","PeriodicalId":172795,"journal":{"name":"2014 24th International Conference on Field Programmable Logic and Applications (FPL)","volume":"31 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-10-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134452667","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
E. Fukuda, Hiroaki Inoue, Takashi Takenaka, Dahoo Kim, Tsunaki Sadahisa, T. Asai, M. Motomura
{"title":"Caching memcached at reconfigurable network interface","authors":"E. Fukuda, Hiroaki Inoue, Takashi Takenaka, Dahoo Kim, Tsunaki Sadahisa, T. Asai, M. Motomura","doi":"10.1109/FPL.2014.6927487","DOIUrl":"https://doi.org/10.1109/FPL.2014.6927487","url":null,"abstract":"Memcached is a technology that improves response speed of web servers by caching data on DRAMs in distributed servers. In order to achieve higher performance, memcached has been evaluated on various platforms. Among them, FPGA seems to be the most efficient platform to run memcached, and several research groups are trying to achieve higher throughput with it. However, it is difficult to utilize a large amount of memory (several dozen gigabytes) with an FPGA. Some groups are trying to solve this problem by using an embedded CPU for memory allocation and another group is employing an SSD. Unlike other approaches that try to replace memcached itself on FPGAs, our approach augments the software memcached running on the host CPU by caching its data and some operations at the FPGA-equipped network interface card (NIC) mounted on the server. The locality of memcached data enables the FPGA NIC to have a fairly high hit rate with a smaller memory. We first explore the cache parameters by software simulations and estimate the effectiveness of our approach, and then prototype a system to prove its effectiveness. Through our evaluation with YCSB, a standard key-value store (KVS) benchmarking tool, we estimate that the latency improved by an order of magnitude over software memcached running on a high performance CPU.","PeriodicalId":172795,"journal":{"name":"2014 24th International Conference on Field Programmable Logic and Applications (FPL)","volume":"8 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-10-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132901424","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Transparent insertion of latency-oblivious logic onto FPGAs","authors":"Eddie Hung, T. Todman, W. Luk","doi":"10.1109/FPL.2014.6927497","DOIUrl":"https://doi.org/10.1109/FPL.2014.6927497","url":null,"abstract":"We present an approach for inserting latency-oblivious functionality into pre-existing FPGA circuits transparently. To ensure transparency - that such modifications do not affect the design's maximum clock frequency - we insert any additional logic post place-and-route, using only the spare resources that were not consumed by the pre-existing circuit. The typical challenge with adding new functionality into existing circuits incrementally is that spare FPGA resources to host this functionality must be located close to the input signals that it requires, in order to minimise the impact of routing delays. In congested designs, however, such co-location is often not possible. We overcome this challenge by using flow techniques to pipeline and route signals from where they originate, potentially in a region of high resource congestion, into a region of low congestion capable of hosting new circuitry, at the expense of latency. We demonstrate and evaluate our approach by augmenting realistic designs with self-monitoring circuitry, which is not sensitive to latency. We report results on circuits operating over 200MHz and show that our insertions have no impact on timing, are 2-4 times faster than compile-time insertion, and incur only a small power overhead.","PeriodicalId":172795,"journal":{"name":"2014 24th International Conference on Field Programmable Logic and Applications (FPL)","volume":"9 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-10-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114542637","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"RAM-based hardware accelerator for network data anonymization","authors":"Fumito Yamaguchi, Kanae Matsui, H. Nishi","doi":"10.1109/FPL.2014.6927400","DOIUrl":"https://doi.org/10.1109/FPL.2014.6927400","url":null,"abstract":"Many network services including intrusion detection and recommendation provide their services by analyzing information acquired from network transactions. A careful analysis of these data can reveal valuable information when deep packet inspection is performed. Since these packet analyses generate sensitive information from enormous volumes of transmitted data, the requirement for data anonymization has been discussed. There have been many studies of anonymization techniques and their implementation in software applications. However, limited research has been undertaken regarding hardware-based anonymizers. This paper proposes and evaluates a RAM-based anonymization architecture that maintains both high throughput and a low information-loss ratio.","PeriodicalId":172795,"journal":{"name":"2014 24th International Conference on Field Programmable Logic and Applications (FPL)","volume":"18 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-10-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114693456","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}