K. Sherwin, B. Stappers, P. Thiagaraj, K. Wang, O. Sinnen
{"title":"Investigating How Hardware Architectures are Expressed in High-Level Languages for an SKA Algorithm","authors":"K. Sherwin, B. Stappers, P. Thiagaraj, K. Wang, O. Sinnen","doi":"10.1109/FPT.2018.00059","DOIUrl":"https://doi.org/10.1109/FPT.2018.00059","url":null,"abstract":"High-level approaches to hardware development can expedite the design process, allowing for rapid design space exploration. However, in order to generate optimised solutions expert intervention is often still required. This work seeks to explore the relationship between high-level descriptions and the resulting hardware architecture. This aims to reduce the barrier to entry for software developers (without hardware expertise) to produce optimised hardware designs through application of classical loop optimisation techniques. An algorithm from the Square Kilometre Array (SKA) is chosen to demonstrate the effects of such changes in a real world, real-time application requiring high throughput and low power consumption, taking a systematic approach in order to achieve an optimised result. A systolic array design is also discussed and compared with the software style changes. The Intel FPGA SDK for OpenCL (AOCL) Offline Compiler (AOC) is used here for verification and synthesis of the designs being examined, targeting an Arria-10 FPGA accelerator.","PeriodicalId":434541,"journal":{"name":"2018 International Conference on Field-Programmable Technology (FPT)","volume":"82 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114138423","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"MultiMQC: A Multilevel Message Queuing Cache Combining In-NIC and In-Kernel Memories","authors":"Koya Mitsuzuka, Yuta Tokusashi, Hiroki Matsutani","doi":"10.1109/FPT.2018.00029","DOIUrl":"https://doi.org/10.1109/FPT.2018.00029","url":null,"abstract":"Message queuing systems that deliver messages from publishers to subscribers play an important role to collect data from IoT devices. Traditional message queuing systems have improved their performance in the context of transferring log data from publishers such as Web servers to subscribers that analyze the log data. In this case, both publishers and subscribers have been assumed to have enough buffer capacity and can transfer data as jumbo frame packets for high efficiency. In recent IoT applications, however, publishers are small sensors or edge devices with low-power processors and limited memory capacity. Vast numbers of such publishers produce relatively small packets. Such a lot of small messages significantly decrease the efficiency of conventional message queuing systems. To address this issue, a dedicated message queuing logic can be implemented on FPGA-based network interface card (FPGA NIC). However, a serious issue of such in-NIC approach is a limited memory capacity on the FPGA NIC. To handle message overflow of the in-NIC cache, in this paper, it is combined with a large in-kernel software cache. More specifically, we propose a multilevel message queuing cache combining in-NIC and in-kernel memories, called MultiMQC. The multilevel cache improves the read performance. Regarding the write performance, MultiMQC introduces a batch transfer that packs small incoming messages into a single batch. We implemented MultiMQC using NetFPGA-SUME board as in-NIC cache and Linux Netfilter framework as in-kernel cache. The experimental results demonstrate that the write throughput is increased in proportion to the batch size. When pull requests hit in the in-NIC cache, the read throughput reaches 95.8% of 10GbE line rate in four 10GbE interfaces.","PeriodicalId":434541,"journal":{"name":"2018 International Conference on Field-Programmable Technology (FPT)","volume":"54 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130358727","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Festus Hategekimana, Joel Mandebi Mbongue, Md Jubaer Hossain Pantho, C. Bobda
{"title":"Secure Hardware Kernels Execution in CPU+FPGA Heterogeneous Cloud","authors":"Festus Hategekimana, Joel Mandebi Mbongue, Md Jubaer Hossain Pantho, C. Bobda","doi":"10.1109/FPT.2018.00035","DOIUrl":"https://doi.org/10.1109/FPT.2018.00035","url":null,"abstract":"In this paper, we present a new security framework which allows controlled sharing and isolated execution of mutually distrusted FPGA-accelerators in heterogeneous cloud systems. The proposed framework enables the accelerators running in FPGAs in cloud computers to transparently inherit at run-time, software security policies of the virtual machines processes calling them. This capability allows system security policies enforcement mechanism to propagate access control privilege boundaries expressed at the hypervisor level, down to individual FPGA-accelerators. Furthermore, we present a software/hardware prototype implementation of the proposed security framework, showing that it can easily be transparently integrated within the virtual machine software stacks that run in today's cloud-based systems. Experimentation results show our proposed framework provides secure hardware execution with negligible execution overhead on guest VMs applications.","PeriodicalId":434541,"journal":{"name":"2018 International Conference on Field-Programmable Technology (FPT)","volume":"9 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115575977","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yuya Kudo, A. Takada, S. Tsuda, Takumi Sakai, T. Izumi
{"title":"A Platform on All-Programmable SoC for Micro Autonomous Robots","authors":"Yuya Kudo, A. Takada, S. Tsuda, Takumi Sakai, T. Izumi","doi":"10.1109/FPT.2018.00085","DOIUrl":"https://doi.org/10.1109/FPT.2018.00085","url":null,"abstract":"We present a platform on all-programmable SoC for micro autonomous robots for probing, exploring, rescuing, etc. Contrast to challenges for auto drive cars having relatively rich power supply and high-performance computing platform, our challenge is to develop technologies for autonomous robots with tight restrictions on size, weight, and energy consumption. We utilize all-programmable SoCs for the purpose and develop a system including camera interface, image processing, recognition, action planning, and motor control. The key techniques are to optimize the dataflow between software (and main memory) and hardware for efficiency and to adopt a standard stream interface in hardware modules for productivity. The system can be utilized as a common platform for micro autonomous robots. The system is implemented as a robot car named ZybotR2-Z2 and achieves 3 to 37 frame/sec image recognition and car control with single Zynq-7020 device.","PeriodicalId":434541,"journal":{"name":"2018 International Conference on Field-Programmable Technology (FPT)","volume":"2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114291013","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"AConFPGA: A Multiple-Output Boolean Function Approximation DSE Technique Targeting FPGAs","authors":"Jorge Echavarria, S. Wildermann, J. Teich","doi":"10.1109/FPT.2018.00065","DOIUrl":"https://doi.org/10.1109/FPT.2018.00065","url":null,"abstract":"New relaxed quality standards laid down by approximate computing enrich the design pool with architectures dissipating less power, consuming fewer resources or with smaller latencies. In LUT-based FPGA logic approximation, the number of LUTs and latency associated to a design can be optimized by allowing the approximation of circuit results. In this paper, we present techniques for automatic design space exploration (DSE) of Boolean function falsifications and the ability and impact to reduce resources usage as well as the length of critical paths on LUT-based FPGAs. Our experiments give evidence that resource reductions of about 20% are easily achievable for error rates amounting to less than 0.05% w.r.t. accurate designs.","PeriodicalId":434541,"journal":{"name":"2018 International Conference on Field-Programmable Technology (FPT)","volume":"43 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117297865","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"An Accelerated OpenVX Overlay for Pure Software Programmers","authors":"Hossein Omidian, N. Ivanov, G. Lemieux","doi":"10.1109/FPT.2018.00056","DOIUrl":"https://doi.org/10.1109/FPT.2018.00056","url":null,"abstract":"This paper presents an FPGA-based overlay for accelerating computer vision applications written in OpenVX. A software programmer simply writes an application using the standard OpenVX API. The OpenVX overlay consists of an architecture and a runtime system that runs any OpenVX application, unmodified, in an accelerated manner on an FPGA. The architecture uses a Soft Vector Processor (SVP) for general acceleration, and a library of Vector Custom Instructions (VCIs) to further accelerate specific OpenVX kernels in the FPGA fabric. The VCIs are predesigned in advance by a skilled FPGA designer. The runtime system analyzes the OpenVX computational graph and selects some kernel nodes to be executed by VCIs, with the remaining kernel nodes to be executed by the SVP. In making the selection, the runtime system uses an optimization algorithm and relies upon bitstream relocation and bitstream merging to fit multiple VCIs into a single, fixedsize Partially Reconfigurable Region (PRR). The optimization algorithm must select the VCIs that satisfy the area constraint of the PRR and give the best overall application acceleration. For example, on a Canny-blur OpenVX application, an 8-lane SVP achieves speedup of 5.3 over the hard ARM Cortex-A9. Selecting some nodes as VCIs provides another 3.5 times speedup, for an overall speedup of 18.5. The overlay enables OpenVX programmers with no FPGA design knowledge to accelerate their application.","PeriodicalId":434541,"journal":{"name":"2018 International Conference on Field-Programmable Technology (FPT)","volume":"3 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122018901","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Development of an FPGA Controlled \"Mini-Car\" Toward Autonomous Driving","authors":"Musashi Aoto, Y. Wada, Yousuke Numata","doi":"10.1109/FPT.2018.00084","DOIUrl":"https://doi.org/10.1109/FPT.2018.00084","url":null,"abstract":"We are developing an FPGA controlled \"Mini-Car\" for FPT'18 design competition toward realizing an autonomous driving car. In the competition, we need to realize fundamental techniques like localization and path planning, while employing road lane detection, traffic signals detection, and other objective detection methods. In this paper, we summarize our development plan of our Mini-Car to realize an autonomous driving techniques based on the regulations of the competition.","PeriodicalId":434541,"journal":{"name":"2018 International Conference on Field-Programmable Technology (FPT)","volume":"67 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125872132","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A Study on Introducing FPGA to ROS Based Autonomous Driving System","authors":"Yasuhiro Nitta, Sou Tamura, Hideki Takase","doi":"10.1109/FPT.2018.00090","DOIUrl":"https://doi.org/10.1109/FPT.2018.00090","url":null,"abstract":"We are developing an autonomous driving robot using programmable SoC. The robot under development does not communicate with the external PC and performs all judgment and control on the board mounted on the robot. We aim to realize a built-in autonomous driving system with low power consumption and high performance by offloading high-load processing with the FPGA. At present, it is used only for acquiring camera images on the FPGA, but we are planning to do hardware implementation of the system constructed by software. In addition, we used ROS (Robot Operating System) to construct the robot's autonomous driving system, and the components to be developed are reusable. This document describes the detailed configuration and future prospect of the robot currently under development.","PeriodicalId":434541,"journal":{"name":"2018 International Conference on Field-Programmable Technology (FPT)","volume":"39 1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131138604","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Siva Satyendra Sahoo, T. D. A. Nguyen, B. Veeravalli, Akash Kumar
{"title":"QoS-Aware Cross-Layer Reliability-Integrated FPGA-Based Dynamic Partially Reconfigurable System Partitioning","authors":"Siva Satyendra Sahoo, T. D. A. Nguyen, B. Veeravalli, Akash Kumar","doi":"10.1109/FPT.2018.00041","DOIUrl":"https://doi.org/10.1109/FPT.2018.00041","url":null,"abstract":"Dynamic Partial Reconfiguration (DPR) can be used for time-sharing of computing resources within Partially Reconfigurable Regions (PRRs) in FPGA-based systems. The heterogeneous partitioning in such systems allows the user to exploit the application-specific mapping of Partially Reconfigurable Modules (PRMs) to PRRs to implement more efficient designs. It offers increased opportunities in optimizing the reliability of the system across multiple layers - from the low-level physical one to the higher application layer. This method, called cross-layer reliability, can potentially exploit the application-specific tolerances to the quality of service (QoS) to tackle the increasing device fault-rates more cost-effectively by distributing the fault-mitigation to different layers. In this work, we propose a QoS-aware cross-layer reliability-integrated design methodology for FPGA-based DPR systems. Specifically, our methodology analyzes the requirements of the applications in terms of Functional Reliability, System Lifetime and Makespan to determine the best possible combinations of reliability-oriented design choices in different layers. We report up to an average of 24% and 30% performance improvements for single and multi-objective optimization-based system partitioning.","PeriodicalId":434541,"journal":{"name":"2018 International Conference on Field-Programmable Technology (FPT)","volume":"12 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122308583","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Jaume Bosch, Xubin Tan, Antonio Filgueras, Miquel Vidal Piñol, Marc Mateu, Daniel Jiménez-González, C. Álvarez, X. Martorell, E. Ayguadé, Jesús Labarta
{"title":"Application Acceleration on FPGAs with OmpSs@FPGA","authors":"Jaume Bosch, Xubin Tan, Antonio Filgueras, Miquel Vidal Piñol, Marc Mateu, Daniel Jiménez-González, C. Álvarez, X. Martorell, E. Ayguadé, Jesús Labarta","doi":"10.1109/FPT.2018.00021","DOIUrl":"https://doi.org/10.1109/FPT.2018.00021","url":null,"abstract":"OmpSs@FPGA is the flavor of OmpSs that allows offloading application functionality to FPGAs. Similarly to OpenMP, it is based on compiler directives. While the OpenMP specification also includes support for heterogeneous execution, we use OmpSs and OmpSs@FPGA as prototype implementation to develop new ideas for OpenMP. OmpSs@FPGA implements the tasking model with runtime support to automatically exploit all SMP and FPGA resources available in the execution platform. In this paper, we present the OmpSs@FPGA ecosystem, based on the Mercurium compiler and the Nanos++ runtime system. We show how the applications are transformed to run on the SMP cores and the FPGA. The application kernels defined as tasks to be accelerated, using the OmpSs directives are: 1) transformed by the compiler into kernels connected with the proper synchronization and communication ports, 2) extracted to intermediate files, 3) compiled through the FPGA vendor HLS tool, and 4) used to configure the FPGA. Our Nanos++ runtime system schedules the application tasks on the platform, being able to use the SMP cores and the FPGA accelerators at the same time. We present the evaluation of the OmpSs@FPGA environment with the Matrix Multiplication, Cholesky and N-Body benchmarks, showing the internal details of the execution, and the performance obtained on a Zynq Ultrascale+ MPSoC (up to 128x). The source code uses OmpSs@FPGA annotations and different Vivado HLS optimization directives are applied for acceleration.","PeriodicalId":434541,"journal":{"name":"2018 International Conference on Field-Programmable Technology (FPT)","volume":"30 2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126090484","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}