{"title":"DP-Pack: Distributed Parallel Packing for FPGAs","authors":"Qiangpu Chen, Minghua Shen, Nong Xiao","doi":"10.1109/FPT.2018.00054","DOIUrl":"https://doi.org/10.1109/FPT.2018.00054","url":null,"abstract":"Packing is one of the most critical stages in the FPGA physical syntheses flow. In this paper, we propose DP-Pack, a distributed parallel packing approach. DP-Pack consists of two primary steps. First, all of the minimal circuit units are assigned into several subsets where the conflicting units are located in the same subset and the non-conflicting units are distributed in different subsets. Then, the non-conflicting subsets are partitioned by round robin such that the number of subsets in each processor core is equal approximately, leading to good load balance in parallel packing. Second, the parallelization between processor cores is implemented by the MPI-based message queue in a distributed platform. Note that DP-Pack has been integrated into the VTR 7.0 tool. Experimental results show that our DP-Pack scales to 8 processor cores to provide about 1.4~3.2× runtime advantages with acceptable quality degradation, comparing to the academic state-of-the-art AAPack.","PeriodicalId":434541,"journal":{"name":"2018 International Conference on Field-Programmable Technology (FPT)","volume":"34 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116873208","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Hiromichi Wakatsuki, T. Kido, K. Arai, Yuhei Sugata, K. Ootsu, T. Yokota, Takeshi Ohkawa
{"title":"Development of a Robot Car by Single Line Search Method for White Line Detection with FPGA","authors":"Hiromichi Wakatsuki, T. Kido, K. Arai, Yuhei Sugata, K. Ootsu, T. Yokota, Takeshi Ohkawa","doi":"10.1109/FPT.2018.00088","DOIUrl":"https://doi.org/10.1109/FPT.2018.00088","url":null,"abstract":"In level 5 autonomous driving system, image recognition is required as multiplex safety technology. However, real time image recognition is hard for existing microprocessors. Hence, implementation of driving system on FPGA is useful to achieve real time image recognition for autonomous driving. Therefore, this paper describes implementation of autonomous driving robot with image processing using FPGA. Hough transform which is generally used for white line detection, requires high computing cost. We explain our new white line detection method which features low computation cost. As a result of evaluation, image processing performance on software is about 26.1 frame / sec, and on hardware is about 0.5 frame / sec. In addition, hardware implementation using Vivado HLS is described.","PeriodicalId":434541,"journal":{"name":"2018 International Conference on Field-Programmable Technology (FPT)","volume":"144 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123256920","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"An FPGA Realization of OpenPose Based on a Sparse Weight Convolutional Neural Network","authors":"Akira Jinguji, Tomoya Fujii, Shimpei Sato, Hiroki Nakahara","doi":"10.1109/FPT.2018.00061","DOIUrl":"https://doi.org/10.1109/FPT.2018.00061","url":null,"abstract":"The OpenPose is a kind of a deep learning based pose estimator which achieved a top accuracy for multiple person pose estimations. Even if using the OpenPose, it is necessary to used high-performance GPU since it requires massive parameters access with high-bandwidth off-chip GDDR5 memories and a higher operation clock frequency. Thus, the power consumption becomes a critical issue to realization. Also, its computation time is slower than the current video standard frame speed (29.97 FPS). In the paper, we introduce a sparse weight CNN to reduce the amount of memory size for weights, which is Then, we offer the indirect memory access architecture to realize the sparse CNN convolutional operation efficiently. Also, to increase throughput further, we applied the six stages of pipeline architecture with a pipeline buffer memory realization. Our implementation satisfied the timing constraint for real-time applications. Since our architecture computed an image with 42.6 msec, the number of frames per second (FPS) was 23.43. We measured the total board power consumption: It was 55 Watt. Thus, the performance per power efficiency was 0.444 (FPS/W). Compared with the NVidia Titan X Pascal architecture GPU, it was 3.49 times faster, it dissipated 3.54 times lower power, and its performance per power efficiency was 13.05 times better. As far as we know, this work is the first FPGA implementation of the OpenPose.","PeriodicalId":434541,"journal":{"name":"2018 International Conference on Field-Programmable Technology (FPT)","volume":"98 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117200830","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Development of a Control Target Recognition for Autonomous Vehicle Using FPGA with Python","authors":"Hiroki Bingo","doi":"10.1109/FPT.2018.00089","DOIUrl":"https://doi.org/10.1109/FPT.2018.00089","url":null,"abstract":"As an easy development of autonomous driving requiring enormous calculation and electric power, a scheme using FPGA is proposed. To reduce programming effort, a board enabling employment of Python is used, together with high-level libraries. The feasibility of algorithms (white line detection, human detection, etc.) on the FPGA board are investigated.","PeriodicalId":434541,"journal":{"name":"2018 International Conference on Field-Programmable Technology (FPT)","volume":"144 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122949226","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"High-Speed Computation of CRC Codes for FPGAs","authors":"Jakub Cabal, Lukás Kekely, J. Korenek","doi":"10.1109/FPT.2018.00042","DOIUrl":"https://doi.org/10.1109/FPT.2018.00042","url":null,"abstract":"As the throughput of networks and memory interfaces is on a constant rise, there is a need for ever-faster error-detecting codes. Cyclic redundancy checks (CRC) are a common and widely used to ensure consistency or detect accidental changes of data. We propose a novel FPGA architecture for the computation of the CRC designed for general high-speed data transfers. Its key feature is allowing a processing of multiple independent data packets (transactions) in each clock cycle, what is a necessity for achieving high overall throughput on very wide data buses. Experimental results confirm that the proposed architecture reaches an effective throughput sufficient for utilization in multi-terabit Ethernet networks (over 2 Tbps or over 3000 Mpps) on a single Xilinx UltraScale+ FPGA.","PeriodicalId":434541,"journal":{"name":"2018 International Conference on Field-Programmable Technology (FPT)","volume":"127 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116031905","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Benjamin Morcos, T. Stewart, C. Eliasmith, Nachiket Kapre
{"title":"Implementing NEF Neural Networks on Embedded FPGAs","authors":"Benjamin Morcos, T. Stewart, C. Eliasmith, Nachiket Kapre","doi":"10.1109/FPT.2018.00015","DOIUrl":"https://doi.org/10.1109/FPT.2018.00015","url":null,"abstract":"Low-power, high-speed neural networks are critical for providing deployable embedded AI applications at the edge. We describe an FPGA implementation of Neural Engineering Framework (NEF) networks with online learning that outperforms mobile GPU implementations by an order of magnitude or more. Specifically, we provide an embedded Python-capable PYNQ FPGA implementation supported with a High-Level Synthesis (HLS) workflow that allows sub-millisecond implementation of adaptive neural networks with low-latency, direct I/O access to the physical world. We tune the precision of the different intermediate variables in the code to achieve competitive absolute accuracy against slower and larger floating-point reference designs. The online learning component of the neural network exploits immediate feedback to adjust the network weights to best support a given arithmetic precision. As the space of possible design configurations of such networks is vast and is subject to a target accuracy constraint, we use the Hyperopt hyper-parameter tuning tool instead of manual search to find Pareto optimal designs. Specifically, we are able to generate the optimized designs in under 500 iterations of Vivado HLS before running the complete Vivado place-and-route phase on that subset. For neural network populations of 64-4096 neurons and 1-8 representational dimensions our optimized FPGA implementation generated by Hyperopt has a speedup of 10-484× over a competing cuBLAS implementation on the Jetson TX1 GPU while using 2.4-9.5× less power. Our speedups are a result of HLS-specific reformulation (15× improvement), precision adaptation (4× improvement), and low-latency direct I/O access (1000× improvement).","PeriodicalId":434541,"journal":{"name":"2018 International Conference on Field-Programmable Technology (FPT)","volume":"102 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128148201","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Qiang Li, Shane T. Fleming, David B. Thomas, P. Cheung
{"title":"Accelerating Top-k ListNet Training for Ranking Using FPGA","authors":"Qiang Li, Shane T. Fleming, David B. Thomas, P. Cheung","doi":"10.1109/FPT.2018.00044","DOIUrl":"https://doi.org/10.1109/FPT.2018.00044","url":null,"abstract":"Document ranking is used to order query results by relevance, with different document ranking models providing trade-offs between ranking accuracy and training speed. ListNet is a well-known ranking approach which achieves high accuracy, but is infeasible in practice because training time is quadratic in the number of training documents. This paper considers the acceleration of ListNet training using FPGAs, and improves training speed by using hardware-oriented algorithmic optimisations, and by transforming algorithm structures to remove dependencies and expose parallelism. We implemented our approach on a Xilinx ultrascale FPGA board and applied it to the MQ 2008 benchmark dataset for ranking. Compared to existing ranking approaches ours shows an improvement from 0.29 to 0.33 in ranking accuracy on the same dataset using the NDCG@10 metric. Taking into account the communication between software and hardware, we are able to achieve a 3.21x speedup over an Intel Xeon1.6 GHz CPU implementation.","PeriodicalId":434541,"journal":{"name":"2018 International Conference on Field-Programmable Technology (FPT)","volume":"14 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128225740","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Message from the General Chair and Program Co-Chairs","authors":"","doi":"10.1109/fpt.2018.00005","DOIUrl":"https://doi.org/10.1109/fpt.2018.00005","url":null,"abstract":"","PeriodicalId":434541,"journal":{"name":"2018 International Conference on Field-Programmable Technology (FPT)","volume":"547 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133132418","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Compact Area and Performance Modelling for CGRA Architecture Evaluation","authors":"Kuang-Ping Niu, J. Anderson","doi":"10.1109/FPT.2018.00028","DOIUrl":"https://doi.org/10.1109/FPT.2018.00028","url":null,"abstract":"We present area and performance models for use in coarse-grained reconfigurable array (CGRAs) architectural exploration. The area and performance models can be computed rapidly and are incorporated into the open-source CGRA-ME architecture evaluation framework. Area is modelled by synthesizing (into standard cells) commonly occurring CGRA primitives in isolation, and then aggregating the component-wise areas. For performance, we incorporate a fully fledged static-timing analysis (STA) framework into CGRA-ME. The delays in the STA timing graph are annotated based on: 1) a library component-wise delays for logic/memory, and 2) a fanout-based delay estimation model for interconnect. Performance and area are modelled for both performance-optimized and area-optimized standard-cell CGRA implementations. Accuracy of the area and performance models is within 7% and 10%, respectively, of a fully laid-out standard-cell CGRA implementation.","PeriodicalId":434541,"journal":{"name":"2018 International Conference on Field-Programmable Technology (FPT)","volume":"23 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131274582","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"High Performance High-Precision Floating-Point Operations on FPGAs Using OpenCL","authors":"N. Nakasato, H. Daisaka, T. Ishikawa","doi":"10.1109/FPT.2018.00049","DOIUrl":"https://doi.org/10.1109/FPT.2018.00049","url":null,"abstract":"Development of high-level synthesis tools such as OpenCL SDK for FPGAs enables us to design accelerators for scientific applications that can take advantage of flexibility and efficiency of FPGAs. However, the available OpenCL SDKs only support the standard floating-point (FP) formats. In this paper, we present the performance evaluation of high precision FP operations, which are currently not supported in OpenCL, on recent FPGAs. By using a mechanism to call a custom design from an OpenCL kernel, we evaluate the performance of a sample application in high precision FP format binary128. We found that the sustained performance of our design in binary128 on Intel Arria10 and Stratix10 is 19 and 71 Gflops, respectively.","PeriodicalId":434541,"journal":{"name":"2018 International Conference on Field-Programmable Technology (FPT)","volume":"32 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128124130","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}