2019 International Conference on Field-Programmable Technology (ICFPT)最新文献_第3页

Evaluation of Partially Constant, Fine-Grained, Dynamic Partial Reconfigurable Functions in FPGAs fpga中部分常数、细粒度、动态部分可重构函数的求值

2019 International Conference on Field-Programmable Technology (ICFPT) Pub Date : 2019-12-01 DOI: 10.1109/ICFPT47387.2019.00064

Stefan Brennsteiner, T. Arslan, J. Thompson

{"title":"Evaluation of Partially Constant, Fine-Grained, Dynamic Partial Reconfigurable Functions in FPGAs","authors":"Stefan Brennsteiner, T. Arslan, J. Thompson","doi":"10.1109/ICFPT47387.2019.00064","DOIUrl":"https://doi.org/10.1109/ICFPT47387.2019.00064","url":null,"abstract":"Dynamic Partial Reconfiguration (DPR) is a well-established technique for changing the functionality of a circuit in an FPGA during runtime. However, DPR can also be used to simplify any given function by replacing one or more inputs or parts of an input of a function by multiple versions of that function. During deployment, depending on the current value of the replaced inputs, a new partial configuration is programmed. This concept of decomposing digital circuits is known as Boole's expansion theorem (also known as Shannon's expansion theorem). Its feasibility in a DPR scheme is investigated in this work and required conditions for its application to fine-grained functions are identified. An extension of the Xilinx Vivado design flow is presented to facilitate the efficient generation of large numbers of partial configurations. The proposed DPR scheme is applied to fixed-point multiplication and division circuits in order to evaluate its performance. Resource utilization, power, and critical path latency are evaluated and compared with conventional FPGA implementations of the same circuits. It is found that the proposed DPR scheme allows for the reduction in power and in critical-path delay in certain scenarios.","PeriodicalId":241340,"journal":{"name":"2019 International Conference on Field-Programmable Technology (ICFPT)","volume":"16 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128093951","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

Scalable Low-Latency Persistent Neural Machine Translation on CPU Server with Multiple FPGAs 基于多fpga的CPU服务器上可扩展的低延迟持久神经机器翻译

2019 International Conference on Field-Programmable Technology (ICFPT) Pub Date : 2019-12-01 DOI: 10.1109/ICFPT47387.2019.00054

E. Nurvitadhi, Mishali Naik, Andrew Boutros, Prerna Budhkar, A. Jafari, Dongup Kwon, D. Sheffield, Abirami Prabhakaran, Karthik Gururaj, Pranavi Appana

{"title":"Scalable Low-Latency Persistent Neural Machine Translation on CPU Server with Multiple FPGAs","authors":"E. Nurvitadhi, Mishali Naik, Andrew Boutros, Prerna Budhkar, A. Jafari, Dongup Kwon, D. Sheffield, Abirami Prabhakaran, Karthik Gururaj, Pranavi Appana","doi":"10.1109/ICFPT47387.2019.00054","DOIUrl":"https://doi.org/10.1109/ICFPT47387.2019.00054","url":null,"abstract":"We present a CPU server with multiple FPGAs that is purely software-programmable by a unified framework to enable flexible implementation of modern real-life complex AI that scales to large model size (100M+ parameters), while delivering real-time inference latency (~ms). Using multiple FPGAs, we scale by keeping a large model persistent in on-chip memories across FPGAs to avoid costly off-chip accesses. We study systems with 1 to 8 FPGAs for different devices: Intel® Arria® 10, Stratix® 10, and a research Stratix 10 with an AI chiplet. We present the first multi-FPGA evaluation of a complex NMT with bi-directional LSTMs, attention, and beam search. Our system scales well. Going from 1 to 8 FPGAs allows hosting ~8× larger model with only ~2× latency increase. A batch-1 inference for a 100M-parameter NMT on 8 Stratix 10 FPGAs takes only ~10 ms. This system offers 110× better latency than the only prior NMT work on FPGAs, which uses a high-end FPGA and stores the model off-chip.","PeriodicalId":241340,"journal":{"name":"2019 International Conference on Field-Programmable Technology (ICFPT)","volume":"45 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131715863","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 8

Implementation of a ROS-Based Autonomous Vehicle on an FPGA Board 基于ros的自动驾驶汽车在FPGA板上的实现

2019 International Conference on Field-Programmable Technology (ICFPT) Pub Date : 2019-12-01 DOI: 10.1109/ICFPT47387.2019.00092

Kento Hasegawa, Kazunari Takasaki, M. Nishizawa, Ryota Ishikawa, Kazushi Kawamura, N. Togawa

引用次数: 10

Towards the Improvement of Training Efficiency and Image Recognition Accuracy for an FPGA Controlled Mini-Car by Offloading Neural Network Training 通过卸载神经网络训练提高FPGA控制微型汽车的训练效率和图像识别精度

2019 International Conference on Field-Programmable Technology (ICFPT) Pub Date : 2019-12-01 DOI: 10.1109/ICFPT47387.2019.00087

Musashi Aoto, Moe Mitsugi, Takumi Momose, Y. Wada

引用次数: 2

Reducing FPGA Compile Time with Separate Compilation for FPGA Building Blocks 通过FPGA构建块的单独编译减少FPGA编译时间

2019 International Conference on Field-Programmable Technology (ICFPT) Pub Date : 2019-12-01 DOI: 10.1109/ICFPT47387.2019.00026

Yuanlong Xiao, Dongjoon Park, Andrew Butt, Hans Giesen, Zhaoyang Han, Rui Ding, Nevo Magnezi, Raphael Rubin, A. DeHon

{"title":"Reducing FPGA Compile Time with Separate Compilation for FPGA Building Blocks","authors":"Yuanlong Xiao, Dongjoon Park, Andrew Butt, Hans Giesen, Zhaoyang Han, Rui Ding, Nevo Magnezi, Raphael Rubin, A. DeHon","doi":"10.1109/ICFPT47387.2019.00026","DOIUrl":"https://doi.org/10.1109/ICFPT47387.2019.00026","url":null,"abstract":"Today's FPGA compilation is slow because it compiles and co-optimizes the entire design in one monolithic mapping flow. This achieves high quality results but also means a long edit-compile-debug loop that slows development and limits the scope of design-space exploration. We introduce PRflow that uses partial reconfiguration and an overlay packet-switched network to separate the HLS-to-bitstream compilation problem for individual components of the FPGA design. This separation allows both incremental compilation, where a single component can be recompiled without recompiling the entire design, and parallel compilation, where all the components are compiled in parallel. Both uses reduce the compilation time. Mapping the Rosetta Benchmarks to a Xilinx XCZU9EG, we show compilation times reduce from 42 minutes to 12 minutes (one case from 160 minutes to 18 minutes) when running on top of commercial tools from Xilinx. Using Symbiflow (Project X-Ray/Yosys/VPR), we show preliminary evidence we can further reduce most compile times under 5 minutes, with some components mapping in less than 2 minutes.","PeriodicalId":241340,"journal":{"name":"2019 International Conference on Field-Programmable Technology (ICFPT)","volume":"47 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129317287","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 18

A Resource Consumption and Performance Overhead Optimized Reduction Circuit on FPGAs fpga上的资源消耗和性能开销优化降低电路

2019 International Conference on Field-Programmable Technology (ICFPT) Pub Date : 2019-12-01 DOI: 10.1109/ICFPT47387.2019.00049

Linhuai Tang, Gang Cai, Tao Yin, Yong Zheng, Jiamin Chen

引用次数: 4

Survey on FPGAs in Medical Radiology Applications: Challenges, Architectures and Programming Models fpga在医学放射学应用的调查:挑战、架构和编程模型

2019 International Conference on Field-Programmable Technology (ICFPT) Pub Date : 2019-12-01 DOI: 10.1109/ICFPT47387.2019.00047

Daniele Passaretti, J. Joseph, Thilo Pionteck

引用次数: 5

Shrink It or Shed It! Minimize the Use of LSQs in Dataflow Designs 收缩它或摆脱它!尽量减少在数据流设计中使用lsql

2019 International Conference on Field-Programmable Technology (ICFPT) Pub Date : 2019-12-01 DOI: 10.1109/ICFPT47387.2019.00031

Lana Josipović, Atri Bhattacharyya, Andrea Guerrieri, P. Ienne

引用次数: 11

An Open Source FPGA-Optimized Out-of-Order RISC-V Soft Processor 开源fpga优化乱序RISC-V软处理器

2019 International Conference on Field-Programmable Technology (ICFPT) Pub Date : 2019-12-01 DOI: 10.1109/ICFPT47387.2019.00016

Susumu Mashimo, Koji Inoue, Ryota Shioya, Akifumi Fujita, Reoma Matsuo, Seiya Akaki, A. Fukuda, Toru Koizumi, J. Kadomoto, H. Irie, M. Goshima

{"title":"An Open Source FPGA-Optimized Out-of-Order RISC-V Soft Processor","authors":"Susumu Mashimo, Koji Inoue, Ryota Shioya, Akifumi Fujita, Reoma Matsuo, Seiya Akaki, A. Fukuda, Toru Koizumi, J. Kadomoto, H. Irie, M. Goshima","doi":"10.1109/ICFPT47387.2019.00016","DOIUrl":"https://doi.org/10.1109/ICFPT47387.2019.00016","url":null,"abstract":"High-performance soft processors in field-programmable gate arrays (FPGAs) have become increasingly important as recent large FPGA systems have relied on soft processors to run many complex workloads, like a network software stack. An out-of-order (OoO) superscalar approach is a good candidate to improve performance in such cases, as evidenced from OoO hard processor studies. Recent studies have revealed, however, that conventional OoO processor components do not fit well in an FPGA, and it is thus important to carefully design such components for FPGA characteristics. Hence, we propose the RSD processor: a new, open-source OoO RISC-V soft processor optimized for an FPGA. The RSD supports many aggressive OoO execution features, like speculative scheduling, OoO memory instruction execution and disambiguation, a memory dependence predictor, and a non-blocking cache. While the RSD supports such aggressive features, it also leverages FPGA characteristics. Therefore, it consumes fewer FPGA resources than are consumed by existing OoO soft processors, which do not support such aggressive features well. We first introduce the end result of the RSD microarchitecture design and then describe several novel optimization techniques. The RSD achieves up to 2.5-times higher Dhrystone MIPS while using 60% fewer registers and 64% fewer lookup tables (LUTs) as compared to state-of-the-art, open-source OoO processors.","PeriodicalId":241340,"journal":{"name":"2019 International Conference on Field-Programmable Technology (ICFPT)","volume":"66 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127575910","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 23

Revisiting Deep Learning Parallelism: Fine-Grained Inference Engine Utilizing Online Arithmetic 回顾深度学习并行性:利用在线算法的细粒度推理引擎

2019 International Conference on Field-Programmable Technology (ICFPT) Pub Date : 2019-12-01 DOI: 10.1109/ICFPT47387.2019.00073

Ameer Abdelhadi, Lesley Shannon

引用次数: 5