D. H. Noronha, Ruizhe Zhao, Zhiqiang Que, Jeffrey B. Goeders, W. Luk, S. Wilton
{"title":"An Overlay for Rapid FPGA Debug of Machine Learning Applications","authors":"D. H. Noronha, Ruizhe Zhao, Zhiqiang Que, Jeffrey B. Goeders, W. Luk, S. Wilton","doi":"10.1109/ICFPT47387.2019.00024","DOIUrl":"https://doi.org/10.1109/ICFPT47387.2019.00024","url":null,"abstract":"FPGAs show promise as machine learning accelerators for both training and inference. Designing these circuits on reconfigurable technology is challenging, especially due to bugs that only manifest on-chip when the circuit is running at speed. In this paper, we propose a flexible debug overlay family that provides software-like debug times for machine learning applications. At compile time, the overlay is added to the design and compiled. At debug time, the overlay can be configured to record statistical information about identified weight and activation matrices; this configuration can be changed between debug iterations allowing the user to record a different set of matrices, or record different information about the observed matrices. Importantly, no recompilation is required between debug iterations. Although the flexibility of our overlay suffers some overhead compared to fixed instrumentation, we argue that the ability to change the debugging scenario without requiring a recompilation may be compelling and outweigh the disadvantage of higher overhead for many applications.","PeriodicalId":241340,"journal":{"name":"2019 International Conference on Field-Programmable Technology (ICFPT)","volume":"25 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116284947","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A 307-fps 351.7-GOPs/W Deep Learning FPGA Accelerator for Real-Time Scene Text Recognition","authors":"Shirui Zhao, F. An, Hao Yu","doi":"10.1109/ICFPT47387.2019.00043","DOIUrl":"https://doi.org/10.1109/ICFPT47387.2019.00043","url":null,"abstract":"FPGA-based deep learning accelerator has become important for high throughput and low power inference at edges. In this paper, we have developed a computing-in-memory (CIM) accelerator using the binary SegNet (BSEG) for real-time scene text recognition (STR) at edges. The accelerator can perform highly efficient pixel-wise character classification under CIM architecture with massive bit-level parallelism as well as optimized pipeline for low latency at critical path. The BSEG is obtained during training with a small model size of 2.1MB as well as a high classification accuracy over 90% on ICDAR-03 and ICDAR-13 datasets. The RTL-level realized FPGA-accelerator can process the STR with an energy-efficiency of 351.7 GOPs/W and a throughput of 307 fps for processing one frame of 128×32 pixels in latency of 3.875 ms.","PeriodicalId":241340,"journal":{"name":"2019 International Conference on Field-Programmable Technology (ICFPT)","volume":"44 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131954043","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"SoC-FPGA-Based Implementation of Iris Recognition Enhanced by QC-LDPC Codes","authors":"Longyu Ma, Chiu-Wing Sham","doi":"10.1109/ICFPT47387.2019.00075","DOIUrl":"https://doi.org/10.1109/ICFPT47387.2019.00075","url":null,"abstract":"Introducing error correction codes into an iris recognition system to solve the intrinsic fuzziness, like variability and noise in iris codes is a research area that hasn't catch massive attention, but the positive effect brought by error correction should not be underestimated. Rather than theoretical analysis and simulation that have been well-understood and deeply explored, in this paper, we focus on an iris recognition system implementation with an error correction scheme, namely QC-LDPC and the whole system is based on a compact SoCFPGA platform which is a DE-10 nano Cyclone V SOC evaluation board by Intel. Every iris information bit as the input data in this platform will be stored after they are encoded as QC-LDPC codes which make the whole system more feasible than normal LDPC codes and loaded to improve the acceptance rate when a verification request is invoked with a new series of iris information. Moreover, the fundamental modules, such as iris processing, LDPC encoding and decoding in the system are reorganized and distributed to which section (HPS or FPGA) they are more suitable to be employed into, leading to a single chip design.","PeriodicalId":241340,"journal":{"name":"2019 International Conference on Field-Programmable Technology (ICFPT)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132583755","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"In Search of Lost Bandwidth: Extensive Reordering of DRAM Accesses on FPGA","authors":"Gabor Csordas, Mikhail Asiatici, P. Ienne","doi":"10.1109/ICFPT47387.2019.00030","DOIUrl":"https://doi.org/10.1109/ICFPT47387.2019.00030","url":null,"abstract":"For efficient acceleration on FPGA, it is essential for external memory to match the throughput of the processing pipelines. However, the usable DRAM bandwidth decreases significantly if the access pattern causes frequent row conflicts. Memory controllers reorder DRAM commands to minimize row conflicts; however, general-purpose controllers must also minimize latency, which limits the depth of the internal queues over which reordering can occur. For latency-insensitive applications with irregular access pattern, nonblocking caches that support thousands of in-flight misses (miss-optimized memory systems) improve bandwidth utilization by reusing the same memory response to serve as many incoming requests as possible. However, they do not improve the irregularity of the access pattern sent to the memory, meaning that row conflicts will still be an issue. Sending out bursts instead of single memory requests makes the access pattern more sequential; however, realistic implementations trade high throughput for some unnecessary data in the bursts, leading to bandwidth wastage that cancels out part of the gains from regularization. In this paper, we present an alternative approach to extend the scope of DRAM row conflict minimization beyond the possibilities of general purpose DRAM controllers. We use the thousands of future memory requests that spontaneously accumulate inside the miss-optimized memory system to implement an efficient large-scale reordering mechanism. By reordering single requests instead of sending bursts, we regularize the memory access pattern in a way that increases bandwidth utilization without incurring in any data wastage. Our solution outperforms the baseline miss-optimized memory system by up to 81% and has better worst, average, and best performance than DynaBurst across 15 benchmarks and 30 architectures.","PeriodicalId":241340,"journal":{"name":"2019 International Conference on Field-Programmable Technology (ICFPT)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128783704","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Moucheng Yang, Tao Chen, Xuegong Zhou, Liang Zhao, Yun-ping Zhu, Lingli Wang
{"title":"A Complete CPU-FPGA Architecture for Protein Identification with Tandem Mass Spectrometry","authors":"Moucheng Yang, Tao Chen, Xuegong Zhou, Liang Zhao, Yun-ping Zhu, Lingli Wang","doi":"10.1109/ICFPT47387.2019.00051","DOIUrl":"https://doi.org/10.1109/ICFPT47387.2019.00051","url":null,"abstract":"Tandem mass spectrometry-based database searching has currently been a significant technique for protein identification in proteomics. The ever-growing protein databases induce severe challenges for efficient database searching engines. Profiling analysis shows that X!Tandem, one of the most widely used open-source database search engines for protein identification, spends almost 78% of the total time on the scoring process. In this paper, field programmable gate arrays (FPGAs) are used as hardware accelerators due to their ability to parallelize arithmetic operations and execute loops in parallel. A scalable heterogeneous CPU-FPGA architecture is proposed to speed up the whole process of X!Tandem, in which parent ion matching and scoring are implemented on FPGAs. The hardware implementation of the scoring process running on one Xilinx Kintex UltraScale FPGA board (XCKU115) at 150 MHz can achieve 21-fold speedup over original X!Tandem software implementation running on a CPU, while the complete CPU-FPGA architecture, which consists of two FPGA boards, achieves more than 10-fold speedup over CPU-only implementation as far as the whole process of protein identification is concerned.","PeriodicalId":241340,"journal":{"name":"2019 International Conference on Field-Programmable Technology (ICFPT)","volume":"10 3 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129659937","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Exploring Resource-Efficient Acceleration Algorithm for Transposed Convolution of GANs on FPGA","authors":"Xinkai Di, Haigang Yang, Zhihong Huang, Ning Mao, Yiping Jia, Yong Zheng","doi":"10.1109/ICFPT47387.2019.00011","DOIUrl":"https://doi.org/10.1109/ICFPT47387.2019.00011","url":null,"abstract":"In recent years, Generative Adversarial Networks (GANs) have been widely adopted for computer vision tasks such as generation/synthesis of massive images and 3D object modeling. The hardware acceleration of Transposed Convolution layers is especially essential since the Generative Model (Generator) as a critical component in GANs is computationally intensive in nature. In transposed Convolution, the zeros-inserting preprocessing causes sparsity of the feature maps and further results in many invalid operations. Most of the existing FPGA architectures cannot effectively tackle this issue. To address the challenges of implementing Transposed Convolution on FPGAs, we present an innovative dataflow design approach by applying the Winograd algorithm for fast processing with a high efficiency in terms of resource allocations. In addition, we propose an underlying Hardware Accelerator Architecture that features having PUs embedded in Parallel, Pipelined, and Buffered processing flow. In this paper, a parallelism-aware Memory Partition scheme is also exploded for bandwidth efficient data access. Implementations of several state-of-the-art GANs by our approach achieves an average performance of 639.2 GOPS on Xilinx ZCU102 FPGA device. In reference to an optimized conventional accelerator baseline, this work demonstrates an 8.6× (up to 11.7×) improvement in processing performance, compared to below 2.2× improvement by the other works in literature.","PeriodicalId":241340,"journal":{"name":"2019 International Conference on Field-Programmable Technology (ICFPT)","volume":"51 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125065874","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Design and Implementation of Autonomous Driving Robot Car Using SoC FPGA","authors":"A. Kojima, Yuya Osawa","doi":"10.1109/ICFPT47387.2019.00088","DOIUrl":"https://doi.org/10.1109/ICFPT47387.2019.00088","url":null,"abstract":"In this paper, we describe the design and implementation of our autonomous driving robot car for FPT'19 FPGA design contest. The controller of our robot car is implemented on Avnet Ultra96 board using Xilinx UltraScale+ MPSoC, which includes ARM processor and programmable logic. Its object detection uses neural network hardware on programmable logic. On the other hand, its lane-keeping and navigation are implemented as software which uses the processor part.","PeriodicalId":241340,"journal":{"name":"2019 International Conference on Field-Programmable Technology (ICFPT)","volume":"10 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116786766","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yiren Zhao, Xitong Gao, Xuan Guo, Junyi Liu, Erwei Wang, R. Mullins, P. Cheung, G. Constantinides, Chengzhong Xu
{"title":"Automatic Generation of Multi-Precision Multi-Arithmetic CNN Accelerators for FPGAs","authors":"Yiren Zhao, Xitong Gao, Xuan Guo, Junyi Liu, Erwei Wang, R. Mullins, P. Cheung, G. Constantinides, Chengzhong Xu","doi":"10.1109/icfpt47387.2019.00014","DOIUrl":"https://doi.org/10.1109/icfpt47387.2019.00014","url":null,"abstract":"Modern deep Convolutional Neural Networks (CNNs) are computationally demanding, yet real applications often require high throughput and low latency. To help tackle these problems, we propose Tomato, a framework designed to automate the process of generating efficient CNN accelerators. The generated design is pipelined and each convolution layer uses different arithmetics at various precisions. Using Tomato, we showcase state-of-the-art multi-precision multi-arithmetic networks, including MobileNet-V1, running on FPGAs. To our knowledge, this is the first multi-precision multi-arithmetic autogeneration framework for CNNs. In software, Tomato fine-tunes pretrained networks to use a mixture of short powers-of-2 and fixed-point weights with a minimal loss in classification accuracy. The fine-tuned parameters are combined with the templated hardware designs to automatically produce efficient inference circuits in FPGAs. We demonstrate how our approach significantly reduces model sizes and computation complexities, and permits us to pack a complete ImageNet network onto a single FPGA without accessing off-chip memories for the first time. Furthermore, we show how Tomato produces implementations of networks with various sizes running on single or multiple FPGAs. To the best of our knowledge, our automatically generated accelerators outperform closest FPGA-based competitors by at least 2-4× for lantency and throughput; the generated accelerator runs ImageNet classification at a rate of more than 3000 frames per second.","PeriodicalId":241340,"journal":{"name":"2019 International Conference on Field-Programmable Technology (ICFPT)","volume":"8 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-10-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127725629","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"An End-to-End Solution to Autonomous Driving Based on Xilinx FPGA","authors":"Tian Wu, Weiyi Liu, Yongwei Jin","doi":"10.1109/ICFPT47387.2019.00084","DOIUrl":"https://doi.org/10.1109/ICFPT47387.2019.00084","url":null,"abstract":"Nowadays, the autonomous driving topic is very hot, many people are trying to provide a solution to this problem. This time we build our own auto-driving car based on Xilinx Pynq-Z2, it provides an end-to-end solution which inputs images from camera and outputs control instructions directly. The platform also uses the power of Deep learning Processing Unit(DPU) to accelerate the inference process and provides a simulator for training and testing in virtual environment. If the car meets some situations which cannot be handled by AI model, it's easy to add extra traditional computer vision functions to our control system. So our platform can help people who want to try autonomous driving build their own model and test it efficiently. We hope that our platform can be easy to use and extend.","PeriodicalId":241340,"journal":{"name":"2019 International Conference on Field-Programmable Technology (ICFPT)","volume":"44 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-10-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123980487","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Unexpected Diversity: Quantitative Memory Analysis for Zynq UltraScale+ Systems","authors":"Kristiyan Manev, Anuj Vaishnav, Dirk Koch","doi":"10.1109/ICFPT47387.2019.00029","DOIUrl":"https://doi.org/10.1109/ICFPT47387.2019.00029","url":null,"abstract":"Memory throughput is one of the major bottlenecks for accelerator performance. Now that Zynq UltraScale+ systems are being deployed at exascale to edge, it is important to understand their characteristics of the memory subsystem and optimizations possible for developers. In this paper, we extensively evaluate the memory performance and behaviour for various AXI port combinations, burst sizes, access patterns, and the number of accelerators per AXI port. Our results on ZCU102 and Ultra 96 boards show that 1) effective throughput of these systems is reaching only 75% and 92.5% of theoretical maximum respectively, 2) 128 and 192 Byte burst size is often optimal, 3) AXI ports of the same type may not always exhibit similar behaviour, 4) multiplexing accelerators in PL can provide better throughput distribution compared to multiplexing in PS, and 5) using all AXI ports does not lead to the highest performance.","PeriodicalId":241340,"journal":{"name":"2019 International Conference on Field-Programmable Technology (ICFPT)","volume":"16 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-10-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130066767","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}