E. Nurvitadhi, Dongup Kwon, A. Jafari, Andrew Boutros, Jaewoong Sim, Phil Tomson, H. Sumbul, Gregory K. Chen, Phil V. Knag, Raghavan Kumar, R. Krishnamurthy, Sergey Gribok, B. Pasca, M. Langhammer, Debbie Marr, A. Dasu
{"title":"Why Compete When You Can Work Together: FPGA-ASIC Integration for Persistent RNNs","authors":"E. Nurvitadhi, Dongup Kwon, A. Jafari, Andrew Boutros, Jaewoong Sim, Phil Tomson, H. Sumbul, Gregory K. Chen, Phil V. Knag, Raghavan Kumar, R. Krishnamurthy, Sergey Gribok, B. Pasca, M. Langhammer, Debbie Marr, A. Dasu","doi":"10.1109/FCCM.2019.00035","DOIUrl":"https://doi.org/10.1109/FCCM.2019.00035","url":null,"abstract":"Interactive intelligent services, such as smart web search, are important datacenter workloads. They rely on dataintensive deep learning (DL) algorithms with strict latency constraints and thus require balancing both data movement and compute capabilities. As such, a persistent approach that keeps the entire DL model on-chip is becoming the new norm for realtime services to avoid the expensive off-chip memory accesses. This approach is adopted in Microsoft's Brainwave and is also provided by Nvidia's cuDNN libraries. This paper presents a comparative study of FPGA, GPU, and FPGA+ASIC in-package solutions for persistent DL. Unlike prior work, we offer a fair and direct comparison targeting common numerical precisions (FP32, INT8) and modern high-end FPGA (Intel® Stratix®10), GPU (Nvidia Volta), and ASIC (10 nm process), all using the persistent approach. We show that Stratix 10 FPGAs offer 2.7× (FP32) to 8.6× (INT8) lower latency than Volta GPUs across RNN, GRU, and LSTM workloads from DeepBench. The GPU can only utilize ~6% of its peak TOPS, while the FPGA with a more balanced on-chip memory and compute can achieve much higher utilization (~57%). We also study integrating an ASIC chiplet, TensorRAM, with an FPGA as system-in-package to enhance on-chip memory capacity and bandwidth, and provide compute throughput matching the required bandwidth. We show that a small 32 mm2 TensorRAM 10nm chiplet can offer 64 MB memory, 32 TB/s on-chiplet bandwidth, and 64 TOPS (INT8). A small Stratix 10 FPGA with a TensorRAM (INT8) offers 15.9× better latency than GPU (FP32) and 34× higher energy efficiency. It has 2× aggregate on-chip memory capacity compared to a large FPGA or GPU. Overall, our study shows that the FPGA is better than the GPU for persistent DL, and when integrated with an ASIC chiplet, it can offer a more compelling solution.","PeriodicalId":116955,"journal":{"name":"2019 IEEE 27th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)","volume":"85 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123755994","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"π-BA: Bundle Adjustment Acceleration on Embedded FPGAs with Co-observation Optimization","authors":"S. Qin, Qiang Liu, Bo Yu, Shaoshan Liu","doi":"10.1109/FCCM.2019.00024","DOIUrl":"https://doi.org/10.1109/FCCM.2019.00024","url":null,"abstract":"Bundle adjustment (BA) is a fundamental optimization technique used in many crucial applications, including 3D scene reconstruction, robotic localization, camera calibration, autonomous driving, space exploration, street view map generation etc. Essentially, BA is a joint non-linear optimization problem, and one which can consume a significant amount of time and power, especially for large optimization problems. Previous approaches of optimizing BA performance heavily rely on parallel processing or distributed computing, which trade higher power consumption for higher performance. In this paper we propose π-BA, the first hardware-software co-designed BA engine on an embedded FPGA-SoC that exploits custom hardware for higher performance and power efficiency. Specifically, based on our key observation that not all points appear on all images in a BA problem, we designed and implemented a Co-Observation Optimization technique to accelerate BA operations with optimized usage of memory and computation resources. Experimental results confirm that π-BA outperforms the existing software implementations in terms of performance and power consumption.","PeriodicalId":116955,"journal":{"name":"2019 IEEE 27th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)","volume":"11 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114929312","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Jialiang Zhang, Yang Liu, Gaurav Jain, Yue Zha, Jonathan Ta, J. Li
{"title":"MEG: A RISCV-Based System Simulation Infrastructure for Exploring Memory Optimization Using FPGAs and Hybrid Memory Cube","authors":"Jialiang Zhang, Yang Liu, Gaurav Jain, Yue Zha, Jonathan Ta, J. Li","doi":"10.1109/FCCM.2019.00029","DOIUrl":"https://doi.org/10.1109/FCCM.2019.00029","url":null,"abstract":"Emerging 3D memory technologies, such as the Hybrid Memory Cube (HMC) and High Bandwidth Memory (HBM), provide increased bandwidth and massive memory-level parallelism. Efficiently integrating emerging memories into existing system pose new challenges and require detailed evaluation in a real computing environment. In this paper, we propose MEG, an open-source, configurable, cycle-exact, and RISC-V based full system simulation infrastructure using FPGA and HMC. MEG has three highly configurable design components: (i) a HMC adaptation module that not only enables communication between the HMC device and the processor cores but also can be extended to fit other memories (e.g., HBM, nonvolatile memory) with minimal effort, (ii) a reconfigurable memory controller along with its OS support that can be effectively leveraged by system designers to perform software-hardware co-optimization, and (iii) a performance monitor module that effectively improves the observability and debuggability of the system to guide performance optimization. We provide a prototype implementation of MEG on Xilinx VCU110 board and demonstrate its capability, fidelity, and flexibility on real-world benchmark applications. We hope that our open-source release of MEG fills a gap in the space of publicly-available FPGA-based full system simulation infrastructures specifically targeting memory system and inspires further collaborative software/hardware innovations.","PeriodicalId":116955,"journal":{"name":"2019 IEEE 27th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)","volume":"17 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114584732","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"FP-AMR: A Reconfigurable Fabric Framework for Adaptive Mesh Refinement Applications","authors":"Tianqi Wang, Tong Geng, Xi Jin, M. Herbordt","doi":"10.1109/FCCM.2019.00040","DOIUrl":"https://doi.org/10.1109/FCCM.2019.00040","url":null,"abstract":"Adaptive mesh refinement (AMR) is one of the most widely used methods in High Performance Computing accounting a large fraction of all supercomputing cycles. AMR operates by dynamically and adaptively applying computational resources non-uniformly to emphasize regions of the model as a function of their complexity. Because AMR generally uses dynamic and pointer-based data structures, acceleration is challenging, especially in hardware. As far as we are aware there has been no previous work published on accelerating AMR with FPGAs. In this paper, we introduce a reconfigurable fabric framework called FP-AMR. The work is in two parts. In the first FP-AMR offloads the bulk per-timestep computations to the FPGA; analogous systems have previously done this with GPUs. In the second part we show that the rest of the CPU-based tasks–including particle mesh mapping, mesh refinement, and coarsening–can also be mapped efficiently to the FPGA. We have evaluated FP-AMR using the widely used program AMReX and found that a single FPGA outperforms a Xeon E5-2660 CPU server (8 cores) by from 21x -23x depending on problem size and data distribution.","PeriodicalId":116955,"journal":{"name":"2019 IEEE 27th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132653202","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yushan Su, Michael J. Anderson, Jonathan I. Tamir, M. Lustig, Kai Li
{"title":"Compressed Sensing MRI Reconstruction on Intel HARPv2","authors":"Yushan Su, Michael J. Anderson, Jonathan I. Tamir, M. Lustig, Kai Li","doi":"10.1109/FCCM.2019.00041","DOIUrl":"https://doi.org/10.1109/FCCM.2019.00041","url":null,"abstract":"Implementing the Iterative Soft-Thresholding Algorithm (ISTA) of compressed sensing for MRI image reconstruction is a good candidate for designing accelerators because real-time functional MRI applications require intensive computations. A straightforward mapping of the computation graph of ISTA onto an FPGA, with a wide enough datapath to saturate memory bandwidth, would require substantial resources, such that a modest size FPGA would not fit the reconstruction pipeline for an entire MRI image. This paper proposes several methods to design the kernel components of ISTA, such as matrix transpose, datapath reuse, parallelism within maps, and data buffering to overcome the problem. Our implementation with Intel OpenCL SDK and performance evaluation on Intel HARPv2 show that our methods can map the reconstruction for the entire 256x256 MRI image with 8 or more channels to its FPGA, while achieving good overall performance.","PeriodicalId":116955,"journal":{"name":"2019 IEEE 27th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)","volume":"19 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116028815","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Processor Assisted Worklist Scheduling for FPGA Accelerated Graph Processing on a Shared-Memory Platform","authors":"Yu Wang, J. Hoe, E. Nurvitadhi","doi":"10.1109/FCCM.2019.00028","DOIUrl":"https://doi.org/10.1109/FCCM.2019.00028","url":null,"abstract":"FPGA-based processing has gained much attention for accelerating graph analytics because of the demand in performance and energy efficiency. However, while priority scheduling has been shown to be an effective optimization for improving performance for worklist-based graph computations, it is rarely used in accelerator designs due to its implementation complexity and memory-access overhead. In this paper, we present a heterogeneous processing approach for priority scheduling on a shared-memory CPU-FPGA platform. By exploiting the closely-coupled integration of the host processor and the FPGA accelerator, our system dynamically offloads the task of scheduling to a software scheduler on the processor for its programmability, high-capacity cache and low memory latency, while the FPGA graph processing accelerator enjoys the scheduling benefit and delivers higher performance at excellent energy efficiency. To understand the effectiveness of our solution, we compared it with FPGA-only solutions for two scheduling schemes: the well-known Dijkstra scheduling for Single Source Shortest Path and a new scheduling optimization we developed for improving the data locality of Breadth First Search. Whereas the FPGA-only solution requires an impractical amount of on-chip storage to implement a priority queue, the proposed processor-assisted scheduling that moves the task of scheduling to the processor consumes a negligible load on the processor and retains most of the performance benefit from priority scheduling.","PeriodicalId":116955,"journal":{"name":"2019 IEEE 27th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)","volume":"84 7 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116256788","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Sonar: Writing Testbenches through Python","authors":"Varun Sharma, Naif Tarafdar, P. Chow","doi":"10.1109/FCCM.2019.00052","DOIUrl":"https://doi.org/10.1109/FCCM.2019.00052","url":null,"abstract":"Design verification is an important though time-consumingaspect of hardware design. A good testbench should supportperforming functional coverage of a design by making iteasy to implement tests and determine which tests are beingperformed. However, for complex designs, creating and main-taining effective testbenches can take increasing amounts oftime away from actual design. A further complication is theremay be two development flows: conventional hardware writtenin a hardware description language (HDL) such as Verilog orVHDL and high-level synthesis (HLS). In the HLS approach, the hardware is specified in a higher-level language (HLL) and then converted to an HDL through HLS tools. In thisflow, testbenches for the design are written in the same HLLand cosimulation is used to verify the generated HDL. Due totool restrictions, cosimulation may not always work. In VivadoHLS [1] for example, the design must contain control signalsto define when to start and stop the module or the initiationinterval for new data must be one cycle. Without cosimulation, the user must write an HDL testbench manually in additionto a testbench in the HLL for preliminary verification. Tosimplify writing testbenches, we present Sonar: an open-sourcePython library to write cross-language testbenches. From acommon source script, Sonar can generate testbenches writtenin SystemVerilog (SV) and C++. These files can then beimported into standard simulation tools such as ModelSim[2] or Vivado HLS and run. The use of Python makes iteasy to extend Sonar with higher layers of abstraction fortestbenches and integrate it with other software platforms.Sonar is available at https://github.com/UofT-HPRC/sonar.","PeriodicalId":116955,"journal":{"name":"2019 IEEE 27th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)","volume":"21 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128393028","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Florian Faissole, G. Constantinides, David B. Thomas
{"title":"Formalizing Loop-Carried Dependencies in Coq for High-Level Synthesis","authors":"Florian Faissole, G. Constantinides, David B. Thomas","doi":"10.1109/FCCM.2019.00056","DOIUrl":"https://doi.org/10.1109/FCCM.2019.00056","url":null,"abstract":"High-level synthesis (HLS) tools such as VivadoHLS interpret C/C++ code supplemented by proprietary optimization directives called pragmas. In order to perform loop pipelining, HLS compilers have to deal with non-trivial loop-carried data dependencies. In VivadoHLS, the dependence pragma could be used to enforce or to eliminate such dependencies, but, the behavior of this directive is only informally specified through examples. Most of the time programmers and the compiler seem to agree on what the directive means, but the accidental misuse of this pragma can lead to the silent generation of an erroneous register-transfer level (RTL) design, meaning code that previously worked may break with newer more aggressively optimised releases of the compiler. We use the Coq proof assistant to formally specify and verify the behavior of the VivadoHLS dependence pragma. We first embed the syntax and the semantics of a tiny imperative language Imp in Coq and specify a conformance relation between an Imp program and a dependence pragma based on data-flow transformations. We then implement semi-automated methods to formally verify such conformance relations for non-nested loop bodies.","PeriodicalId":116955,"journal":{"name":"2019 IEEE 27th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)","volume":"11 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123654389","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Atsutake Kosuge, Keisuke Yamamoto, Y. Akamine, T. Yamawaki, T. Oshima
{"title":"A 4.8x Faster FPGA-Based Iterative Closest Point Accelerator for Object Pose Estimation of Picking Robot Applications","authors":"Atsutake Kosuge, Keisuke Yamamoto, Y. Akamine, T. Yamawaki, T. Oshima","doi":"10.1109/FCCM.2019.00072","DOIUrl":"https://doi.org/10.1109/FCCM.2019.00072","url":null,"abstract":"An FPGA-based accelerator for the iterative-closest-point (ICP) algorithm has been proposed, which achieves 4.8-times-faster object-pose estimation by a picking robot compared with the state-of-the-art technique. Experiments of the proposed FPGA-based ICP accelerator using Amazon Picking Contest data sets have confirmed that the object-pose estimation by the ICP takes only 0.6 seconds, and the entire picking process takes 2.0 seconds with power consumption of 6.0 W.","PeriodicalId":116955,"journal":{"name":"2019 IEEE 27th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130251212","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}