{"title":"Evaluation of Partially Constant, Fine-Grained, Dynamic Partial Reconfigurable Functions in FPGAs","authors":"Stefan Brennsteiner, T. Arslan, J. Thompson","doi":"10.1109/ICFPT47387.2019.00064","DOIUrl":"https://doi.org/10.1109/ICFPT47387.2019.00064","url":null,"abstract":"Dynamic Partial Reconfiguration (DPR) is a well-established technique for changing the functionality of a circuit in an FPGA during runtime. However, DPR can also be used to simplify any given function by replacing one or more inputs or parts of an input of a function by multiple versions of that function. During deployment, depending on the current value of the replaced inputs, a new partial configuration is programmed. This concept of decomposing digital circuits is known as Boole's expansion theorem (also known as Shannon's expansion theorem). Its feasibility in a DPR scheme is investigated in this work and required conditions for its application to fine-grained functions are identified. An extension of the Xilinx Vivado design flow is presented to facilitate the efficient generation of large numbers of partial configurations. The proposed DPR scheme is applied to fixed-point multiplication and division circuits in order to evaluate its performance. Resource utilization, power, and critical path latency are evaluated and compared with conventional FPGA implementations of the same circuits. It is found that the proposed DPR scheme allows for the reduction in power and in critical-path delay in certain scenarios.","PeriodicalId":241340,"journal":{"name":"2019 International Conference on Field-Programmable Technology (ICFPT)","volume":"16 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128093951","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
E. Nurvitadhi, Mishali Naik, Andrew Boutros, Prerna Budhkar, A. Jafari, Dongup Kwon, D. Sheffield, Abirami Prabhakaran, Karthik Gururaj, Pranavi Appana
{"title":"Scalable Low-Latency Persistent Neural Machine Translation on CPU Server with Multiple FPGAs","authors":"E. Nurvitadhi, Mishali Naik, Andrew Boutros, Prerna Budhkar, A. Jafari, Dongup Kwon, D. Sheffield, Abirami Prabhakaran, Karthik Gururaj, Pranavi Appana","doi":"10.1109/ICFPT47387.2019.00054","DOIUrl":"https://doi.org/10.1109/ICFPT47387.2019.00054","url":null,"abstract":"We present a CPU server with multiple FPGAs that is purely software-programmable by a unified framework to enable flexible implementation of modern real-life complex AI that scales to large model size (100M+ parameters), while delivering real-time inference latency (~ms). Using multiple FPGAs, we scale by keeping a large model persistent in on-chip memories across FPGAs to avoid costly off-chip accesses. We study systems with 1 to 8 FPGAs for different devices: Intel® Arria® 10, Stratix® 10, and a research Stratix 10 with an AI chiplet. We present the first multi-FPGA evaluation of a complex NMT with bi-directional LSTMs, attention, and beam search. Our system scales well. Going from 1 to 8 FPGAs allows hosting ~8× larger model with only ~2× latency increase. A batch-1 inference for a 100M-parameter NMT on 8 Stratix 10 FPGAs takes only ~10 ms. This system offers 110× better latency than the only prior NMT work on FPGAs, which uses a high-end FPGA and stores the model off-chip.","PeriodicalId":241340,"journal":{"name":"2019 International Conference on Field-Programmable Technology (ICFPT)","volume":"45 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131715863","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Kento Hasegawa, Kazunari Takasaki, M. Nishizawa, Ryota Ishikawa, Kazushi Kawamura, N. Togawa
{"title":"Implementation of a ROS-Based Autonomous Vehicle on an FPGA Board","authors":"Kento Hasegawa, Kazunari Takasaki, M. Nishizawa, Ryota Ishikawa, Kazushi Kawamura, N. Togawa","doi":"10.1109/ICFPT47387.2019.00092","DOIUrl":"https://doi.org/10.1109/ICFPT47387.2019.00092","url":null,"abstract":"Due to the development of high-performance LSIs, autonomous vehicle will be realized in a few years. The FPGA Design Competition is one of the valuable opportunities to demonstrate an autonomous vehicle on miniature roads. In this paper, we develop a ROS-based autonomous vehicle which is implemented on an FPGA board as a mock car. ROS is a common framework designed to implement various types of robots. Utilizing the ROS-based platform, we develop a model car for the demonstration of an autonomous vehicle. The on-board programmable logic is used to off-load the processing of image recognition such as lane detection, traffic signal detection, and obstacle detection. The implementation results demonstrate that we can successfully implement essential components of the vehicle on an FPGA board with the ROS-based system.","PeriodicalId":241340,"journal":{"name":"2019 International Conference on Field-Programmable Technology (ICFPT)","volume":"23 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134408449","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Towards the Improvement of Training Efficiency and Image Recognition Accuracy for an FPGA Controlled Mini-Car by Offloading Neural Network Training","authors":"Musashi Aoto, Moe Mitsugi, Takumi Momose, Y. Wada","doi":"10.1109/ICFPT47387.2019.00087","DOIUrl":"https://doi.org/10.1109/ICFPT47387.2019.00087","url":null,"abstract":"This paper describes the design of our field-programmable gate array (FPGA)-controlled Mini-Car and the development strategy for the FPT2019 FPGA Design Competition. We have improved our development strategy for the FPGA-controlled Mini-Car by extending our previous design for the HEART2019 FPGA Design Contest. In our new development plan, we employ multiple image sensors to capture both road conditions and traffic lights at the same time. To manage these diverse image information, we utilize multiple simple functioned neural networks for more accurate image recognition. Embedded FPGA platforms are not powerful enough for training these neural networks efficiently; therefore, we are also trying to develop a practical framework to offload the neural network training computation to high-performance servers. This framework will allow us to utilize the trained network information on our FPGA-controlled Mini-Car efficiently.","PeriodicalId":241340,"journal":{"name":"2019 International Conference on Field-Programmable Technology (ICFPT)","volume":"401 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122860580","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yuanlong Xiao, Dongjoon Park, Andrew Butt, Hans Giesen, Zhaoyang Han, Rui Ding, Nevo Magnezi, Raphael Rubin, A. DeHon
{"title":"Reducing FPGA Compile Time with Separate Compilation for FPGA Building Blocks","authors":"Yuanlong Xiao, Dongjoon Park, Andrew Butt, Hans Giesen, Zhaoyang Han, Rui Ding, Nevo Magnezi, Raphael Rubin, A. DeHon","doi":"10.1109/ICFPT47387.2019.00026","DOIUrl":"https://doi.org/10.1109/ICFPT47387.2019.00026","url":null,"abstract":"Today's FPGA compilation is slow because it compiles and co-optimizes the entire design in one monolithic mapping flow. This achieves high quality results but also means a long edit-compile-debug loop that slows development and limits the scope of design-space exploration. We introduce PRflow that uses partial reconfiguration and an overlay packet-switched network to separate the HLS-to-bitstream compilation problem for individual components of the FPGA design. This separation allows both incremental compilation, where a single component can be recompiled without recompiling the entire design, and parallel compilation, where all the components are compiled in parallel. Both uses reduce the compilation time. Mapping the Rosetta Benchmarks to a Xilinx XCZU9EG, we show compilation times reduce from 42 minutes to 12 minutes (one case from 160 minutes to 18 minutes) when running on top of commercial tools from Xilinx. Using Symbiflow (Project X-Ray/Yosys/VPR), we show preliminary evidence we can further reduce most compile times under 5 minutes, with some components mapping in less than 2 minutes.","PeriodicalId":241340,"journal":{"name":"2019 International Conference on Field-Programmable Technology (ICFPT)","volume":"47 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129317287","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Linhuai Tang, Gang Cai, Tao Yin, Yong Zheng, Jiamin Chen
{"title":"A Resource Consumption and Performance Overhead Optimized Reduction Circuit on FPGAs","authors":"Linhuai Tang, Gang Cai, Tao Yin, Yong Zheng, Jiamin Chen","doi":"10.1109/ICFPT47387.2019.00049","DOIUrl":"https://doi.org/10.1109/ICFPT47387.2019.00049","url":null,"abstract":"Many scientific and engineering applications involve massive vector operations (such as dot product and matrix multiplication) which can be calculated efficiently by using reduction circuit. However, the low performance and large resource consumption of the reduction circuit limit the ability of the system. In this paper, an optimized reduction circuit with high performance and low resource consumption is proposed, which can handle multiple sets of arbitrary size without pipeline stalling. A new reduction scheduling algorithm is proposed, which consumes fewer cycles and buffer size compared with other methods. Moreover, in order to achieve a high clock frequency, the reduction circuit implements novel status and buffer management modules. The proposed design using a deeply pipelined double-precision floating-point adder as reduction operator is implemented on FPGAs, which achieves at least 1.59 times improvement on area-time product compared with the reported methods.","PeriodicalId":241340,"journal":{"name":"2019 International Conference on Field-Programmable Technology (ICFPT)","volume":"25 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116014056","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Survey on FPGAs in Medical Radiology Applications: Challenges, Architectures and Programming Models","authors":"Daniele Passaretti, J. Joseph, Thilo Pionteck","doi":"10.1109/ICFPT47387.2019.00047","DOIUrl":"https://doi.org/10.1109/ICFPT47387.2019.00047","url":null,"abstract":"This paper provides a survey on hardware architectures and programming models for FPGAs in medical radiology applications. This area imposes many challenging constraints on the underlying hardware platform that can be solved best by FPGAs: hard real-time, low latency inter-device communication, safety, huge data volume, and on-the-fly image processing tasks. We consider these aspects from the application as well as from the FPGA design perspective and provide a definition and classification of the most relevant challenges, trends, and requirements. Therefore, we analyse different application scenarios ranging from basic Computed Tomography (CT) to modern Interventional Radiology systems. The main focus of our work lies on CT appliances. Finally, we discuss trends of architectures solutions and programming models in these complex systems. By means of this systematic literature survey, we derive an architectural model from real-time, patient and application constraints for hardware design and system engineering tasks.","PeriodicalId":241340,"journal":{"name":"2019 International Conference on Field-Programmable Technology (ICFPT)","volume":"60 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116416079","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Lana Josipović, Atri Bhattacharyya, Andrea Guerrieri, P. Ienne
{"title":"Shrink It or Shed It! Minimize the Use of LSQs in Dataflow Designs","authors":"Lana Josipović, Atri Bhattacharyya, Andrea Guerrieri, P. Ienne","doi":"10.1109/ICFPT47387.2019.00031","DOIUrl":"https://doi.org/10.1109/ICFPT47387.2019.00031","url":null,"abstract":"When applications have unpredictable memory accesses or irregular control flow, dataflow circuits overcome the limitations of statically scheduled high-level synthesis (HLS). If memory dependences cannot be determined at compile time, dataflow circuits rely on load-store queues (LSQs) to resolve the dependences dynamically, as the circuit runs. However, when employed on reconfigurable platforms, these LSQs are resource-expensive, slow, and power-consuming. In this work, we explore techniques for reducing the cost of the memory interface in dataflow designs. Apart from exploiting standard memory analysis techniques, we present a novel approach which relies on the topology of the control and dataflow graphs to infer memory order with the purpose of minimizing the LSQ size and complexity. On benchmarks obtained automatically from C code, we show that our approach results in significant area reductions, as well as increased performance, compared to naive solutions.","PeriodicalId":241340,"journal":{"name":"2019 International Conference on Field-Programmable Technology (ICFPT)","volume":"11 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125456438","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Susumu Mashimo, Koji Inoue, Ryota Shioya, Akifumi Fujita, Reoma Matsuo, Seiya Akaki, A. Fukuda, Toru Koizumi, J. Kadomoto, H. Irie, M. Goshima
{"title":"An Open Source FPGA-Optimized Out-of-Order RISC-V Soft Processor","authors":"Susumu Mashimo, Koji Inoue, Ryota Shioya, Akifumi Fujita, Reoma Matsuo, Seiya Akaki, A. Fukuda, Toru Koizumi, J. Kadomoto, H. Irie, M. Goshima","doi":"10.1109/ICFPT47387.2019.00016","DOIUrl":"https://doi.org/10.1109/ICFPT47387.2019.00016","url":null,"abstract":"High-performance soft processors in field-programmable gate arrays (FPGAs) have become increasingly important as recent large FPGA systems have relied on soft processors to run many complex workloads, like a network software stack. An out-of-order (OoO) superscalar approach is a good candidate to improve performance in such cases, as evidenced from OoO hard processor studies. Recent studies have revealed, however, that conventional OoO processor components do not fit well in an FPGA, and it is thus important to carefully design such components for FPGA characteristics. Hence, we propose the RSD processor: a new, open-source OoO RISC-V soft processor optimized for an FPGA. The RSD supports many aggressive OoO execution features, like speculative scheduling, OoO memory instruction execution and disambiguation, a memory dependence predictor, and a non-blocking cache. While the RSD supports such aggressive features, it also leverages FPGA characteristics. Therefore, it consumes fewer FPGA resources than are consumed by existing OoO soft processors, which do not support such aggressive features well. We first introduce the end result of the RSD microarchitecture design and then describe several novel optimization techniques. The RSD achieves up to 2.5-times higher Dhrystone MIPS while using 60% fewer registers and 64% fewer lookup tables (LUTs) as compared to state-of-the-art, open-source OoO processors.","PeriodicalId":241340,"journal":{"name":"2019 International Conference on Field-Programmable Technology (ICFPT)","volume":"66 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127575910","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Revisiting Deep Learning Parallelism: Fine-Grained Inference Engine Utilizing Online Arithmetic","authors":"Ameer Abdelhadi, Lesley Shannon","doi":"10.1109/ICFPT47387.2019.00073","DOIUrl":"https://doi.org/10.1109/ICFPT47387.2019.00073","url":null,"abstract":"In this paper, we revisit the parallelism of neural inference engines. In a departure from the conventional coarse-grained neuron-level parallelism, we propose a synapse-level parallelism by performing highly parallel fine-grained neural computations. Our method employs online Most Significant Digit First (MSDF) digit-serial arithmetic to enable early termination of the computation. Using online MSDF bit-serial arithmetic for DNN inference (1) enables early termination of ineffectual computations, (2) enables mixed-precision operations (3) allows higher frequencies without compromising latency, and (4) alleviates the infamous weights memory bottleneck. The proposed technique is efficiently implemented on FPGAs due to their concurrent fine-grained nature, and the availability of on-chip distributed SRAM blocks. Compared to other bit-serial methods, our Fine-Grained Inference Engine (FGIE) improves energy efficiency by ×1.8 while having similar performance gains.","PeriodicalId":241340,"journal":{"name":"2019 International Conference on Field-Programmable Technology (ICFPT)","volume":"49 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122676253","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}