2019 International Conference on Field-Programmable Technology (ICFPT)最新文献_第9页

An Overlay for Rapid FPGA Debug of Machine Learning Applications 一种基于FPGA的机器学习快速调试方法

2019 International Conference on Field-Programmable Technology (ICFPT) Pub Date : 2019-12-01 DOI: 10.1109/ICFPT47387.2019.00024

D. H. Noronha, Ruizhe Zhao, Zhiqiang Que, Jeffrey B. Goeders, W. Luk, S. Wilton

{"title":"An Overlay for Rapid FPGA Debug of Machine Learning Applications","authors":"D. H. Noronha, Ruizhe Zhao, Zhiqiang Que, Jeffrey B. Goeders, W. Luk, S. Wilton","doi":"10.1109/ICFPT47387.2019.00024","DOIUrl":"https://doi.org/10.1109/ICFPT47387.2019.00024","url":null,"abstract":"FPGAs show promise as machine learning accelerators for both training and inference. Designing these circuits on reconfigurable technology is challenging, especially due to bugs that only manifest on-chip when the circuit is running at speed. In this paper, we propose a flexible debug overlay family that provides software-like debug times for machine learning applications. At compile time, the overlay is added to the design and compiled. At debug time, the overlay can be configured to record statistical information about identified weight and activation matrices; this configuration can be changed between debug iterations allowing the user to record a different set of matrices, or record different information about the observed matrices. Importantly, no recompilation is required between debug iterations. Although the flexibility of our overlay suffers some overhead compared to fixed instrumentation, we argue that the ability to change the debugging scenario without requiring a recompilation may be compelling and outweigh the disadvantage of higher overhead for many applications.","PeriodicalId":241340,"journal":{"name":"2019 International Conference on Field-Programmable Technology (ICFPT)","volume":"25 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116284947","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 7

A 307-fps 351.7-GOPs/W Deep Learning FPGA Accelerator for Real-Time Scene Text Recognition 用于实时场景文本识别的307fps 351.7 gops /W深度学习FPGA加速器

2019 International Conference on Field-Programmable Technology (ICFPT) Pub Date : 2019-12-01 DOI: 10.1109/ICFPT47387.2019.00043

Shirui Zhao, F. An, Hao Yu

引用次数: 5

SoC-FPGA-Based Implementation of Iris Recognition Enhanced by QC-LDPC Codes 基于soc - fpga的QC-LDPC码增强虹膜识别实现

2019 International Conference on Field-Programmable Technology (ICFPT) Pub Date : 2019-12-01 DOI: 10.1109/ICFPT47387.2019.00075

Longyu Ma, Chiu-Wing Sham

{"title":"SoC-FPGA-Based Implementation of Iris Recognition Enhanced by QC-LDPC Codes","authors":"Longyu Ma, Chiu-Wing Sham","doi":"10.1109/ICFPT47387.2019.00075","DOIUrl":"https://doi.org/10.1109/ICFPT47387.2019.00075","url":null,"abstract":"Introducing error correction codes into an iris recognition system to solve the intrinsic fuzziness, like variability and noise in iris codes is a research area that hasn't catch massive attention, but the positive effect brought by error correction should not be underestimated. Rather than theoretical analysis and simulation that have been well-understood and deeply explored, in this paper, we focus on an iris recognition system implementation with an error correction scheme, namely QC-LDPC and the whole system is based on a compact SoCFPGA platform which is a DE-10 nano Cyclone V SOC evaluation board by Intel. Every iris information bit as the input data in this platform will be stored after they are encoded as QC-LDPC codes which make the whole system more feasible than normal LDPC codes and loaded to improve the acceptance rate when a verification request is invoked with a new series of iris information. Moreover, the fundamental modules, such as iris processing, LDPC encoding and decoding in the system are reorganized and distributed to which section (HPS or FPGA) they are more suitable to be employed into, leading to a single chip design.","PeriodicalId":241340,"journal":{"name":"2019 International Conference on Field-Programmable Technology (ICFPT)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132583755","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 5

In Search of Lost Bandwidth: Extensive Reordering of DRAM Accesses on FPGA 寻找丢失的带宽:FPGA上DRAM访问的广泛重排序

2019 International Conference on Field-Programmable Technology (ICFPT) Pub Date : 2019-12-01 DOI: 10.1109/ICFPT47387.2019.00030

Gabor Csordas, Mikhail Asiatici, P. Ienne

{"title":"In Search of Lost Bandwidth: Extensive Reordering of DRAM Accesses on FPGA","authors":"Gabor Csordas, Mikhail Asiatici, P. Ienne","doi":"10.1109/ICFPT47387.2019.00030","DOIUrl":"https://doi.org/10.1109/ICFPT47387.2019.00030","url":null,"abstract":"For efficient acceleration on FPGA, it is essential for external memory to match the throughput of the processing pipelines. However, the usable DRAM bandwidth decreases significantly if the access pattern causes frequent row conflicts. Memory controllers reorder DRAM commands to minimize row conflicts; however, general-purpose controllers must also minimize latency, which limits the depth of the internal queues over which reordering can occur. For latency-insensitive applications with irregular access pattern, nonblocking caches that support thousands of in-flight misses (miss-optimized memory systems) improve bandwidth utilization by reusing the same memory response to serve as many incoming requests as possible. However, they do not improve the irregularity of the access pattern sent to the memory, meaning that row conflicts will still be an issue. Sending out bursts instead of single memory requests makes the access pattern more sequential; however, realistic implementations trade high throughput for some unnecessary data in the bursts, leading to bandwidth wastage that cancels out part of the gains from regularization. In this paper, we present an alternative approach to extend the scope of DRAM row conflict minimization beyond the possibilities of general purpose DRAM controllers. We use the thousands of future memory requests that spontaneously accumulate inside the miss-optimized memory system to implement an efficient large-scale reordering mechanism. By reordering single requests instead of sending bursts, we regularize the memory access pattern in a way that increases bandwidth utilization without incurring in any data wastage. Our solution outperforms the baseline miss-optimized memory system by up to 81% and has better worst, average, and best performance than DynaBurst across 15 benchmarks and 30 architectures.","PeriodicalId":241340,"journal":{"name":"2019 International Conference on Field-Programmable Technology (ICFPT)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128783704","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 4

A Complete CPU-FPGA Architecture for Protein Identification with Tandem Mass Spectrometry 一个完整的CPU-FPGA结构的蛋白质鉴定与串联质谱

2019 International Conference on Field-Programmable Technology (ICFPT) Pub Date : 2019-12-01 DOI: 10.1109/ICFPT47387.2019.00051

Moucheng Yang, Tao Chen, Xuegong Zhou, Liang Zhao, Yun-ping Zhu, Lingli Wang

{"title":"A Complete CPU-FPGA Architecture for Protein Identification with Tandem Mass Spectrometry","authors":"Moucheng Yang, Tao Chen, Xuegong Zhou, Liang Zhao, Yun-ping Zhu, Lingli Wang","doi":"10.1109/ICFPT47387.2019.00051","DOIUrl":"https://doi.org/10.1109/ICFPT47387.2019.00051","url":null,"abstract":"Tandem mass spectrometry-based database searching has currently been a significant technique for protein identification in proteomics. The ever-growing protein databases induce severe challenges for efficient database searching engines. Profiling analysis shows that X!Tandem, one of the most widely used open-source database search engines for protein identification, spends almost 78% of the total time on the scoring process. In this paper, field programmable gate arrays (FPGAs) are used as hardware accelerators due to their ability to parallelize arithmetic operations and execute loops in parallel. A scalable heterogeneous CPU-FPGA architecture is proposed to speed up the whole process of X!Tandem, in which parent ion matching and scoring are implemented on FPGAs. The hardware implementation of the scoring process running on one Xilinx Kintex UltraScale FPGA board (XCKU115) at 150 MHz can achieve 21-fold speedup over original X!Tandem software implementation running on a CPU, while the complete CPU-FPGA architecture, which consists of two FPGA boards, achieves more than 10-fold speedup over CPU-only implementation as far as the whole process of protein identification is concerned.","PeriodicalId":241340,"journal":{"name":"2019 International Conference on Field-Programmable Technology (ICFPT)","volume":"10 3 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129659937","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

Exploring Resource-Efficient Acceleration Algorithm for Transposed Convolution of GANs on FPGA 基于FPGA的gan转置卷积资源高效加速算法研究

2019 International Conference on Field-Programmable Technology (ICFPT) Pub Date : 2019-12-01 DOI: 10.1109/ICFPT47387.2019.00011

Xinkai Di, Haigang Yang, Zhihong Huang, Ning Mao, Yiping Jia, Yong Zheng

{"title":"Exploring Resource-Efficient Acceleration Algorithm for Transposed Convolution of GANs on FPGA","authors":"Xinkai Di, Haigang Yang, Zhihong Huang, Ning Mao, Yiping Jia, Yong Zheng","doi":"10.1109/ICFPT47387.2019.00011","DOIUrl":"https://doi.org/10.1109/ICFPT47387.2019.00011","url":null,"abstract":"In recent years, Generative Adversarial Networks (GANs) have been widely adopted for computer vision tasks such as generation/synthesis of massive images and 3D object modeling. The hardware acceleration of Transposed Convolution layers is especially essential since the Generative Model (Generator) as a critical component in GANs is computationally intensive in nature. In transposed Convolution, the zeros-inserting preprocessing causes sparsity of the feature maps and further results in many invalid operations. Most of the existing FPGA architectures cannot effectively tackle this issue. To address the challenges of implementing Transposed Convolution on FPGAs, we present an innovative dataflow design approach by applying the Winograd algorithm for fast processing with a high efficiency in terms of resource allocations. In addition, we propose an underlying Hardware Accelerator Architecture that features having PUs embedded in Parallel, Pipelined, and Buffered processing flow. In this paper, a parallelism-aware Memory Partition scheme is also exploded for bandwidth efficient data access. Implementations of several state-of-the-art GANs by our approach achieves an average performance of 639.2 GOPS on Xilinx ZCU102 FPGA device. In reference to an optimized conventional accelerator baseline, this work demonstrates an 8.6× (up to 11.7×) improvement in processing performance, compared to below 2.2× improvement by the other works in literature.","PeriodicalId":241340,"journal":{"name":"2019 International Conference on Field-Programmable Technology (ICFPT)","volume":"51 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125065874","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 7

Design and Implementation of Autonomous Driving Robot Car Using SoC FPGA 基于SoC FPGA的自动驾驶机器人汽车设计与实现

2019 International Conference on Field-Programmable Technology (ICFPT) Pub Date : 2019-12-01 DOI: 10.1109/ICFPT47387.2019.00088

A. Kojima, Yuya Osawa

引用次数: 2

Automatic Generation of Multi-Precision Multi-Arithmetic CNN Accelerators for FPGAs fpga多精度多算法CNN加速器的自动生成

2019 International Conference on Field-Programmable Technology (ICFPT) Pub Date : 2019-10-21 DOI: 10.1109/icfpt47387.2019.00014

Yiren Zhao, Xitong Gao, Xuan Guo, Junyi Liu, Erwei Wang, R. Mullins, P. Cheung, G. Constantinides, Chengzhong Xu

{"title":"Automatic Generation of Multi-Precision Multi-Arithmetic CNN Accelerators for FPGAs","authors":"Yiren Zhao, Xitong Gao, Xuan Guo, Junyi Liu, Erwei Wang, R. Mullins, P. Cheung, G. Constantinides, Chengzhong Xu","doi":"10.1109/icfpt47387.2019.00014","DOIUrl":"https://doi.org/10.1109/icfpt47387.2019.00014","url":null,"abstract":"Modern deep Convolutional Neural Networks (CNNs) are computationally demanding, yet real applications often require high throughput and low latency. To help tackle these problems, we propose Tomato, a framework designed to automate the process of generating efficient CNN accelerators. The generated design is pipelined and each convolution layer uses different arithmetics at various precisions. Using Tomato, we showcase state-of-the-art multi-precision multi-arithmetic networks, including MobileNet-V1, running on FPGAs. To our knowledge, this is the first multi-precision multi-arithmetic autogeneration framework for CNNs. In software, Tomato fine-tunes pretrained networks to use a mixture of short powers-of-2 and fixed-point weights with a minimal loss in classification accuracy. The fine-tuned parameters are combined with the templated hardware designs to automatically produce efficient inference circuits in FPGAs. We demonstrate how our approach significantly reduces model sizes and computation complexities, and permits us to pack a complete ImageNet network onto a single FPGA without accessing off-chip memories for the first time. Furthermore, we show how Tomato produces implementations of networks with various sizes running on single or multiple FPGAs. To the best of our knowledge, our automatically generated accelerators outperform closest FPGA-based competitors by at least 2-4× for lantency and throughput; the generated accelerator runs ImageNet classification at a rate of more than 3000 frames per second.","PeriodicalId":241340,"journal":{"name":"2019 International Conference on Field-Programmable Technology (ICFPT)","volume":"8 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-10-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127725629","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 23

An End-to-End Solution to Autonomous Driving Based on Xilinx FPGA 基于赛灵思FPGA的端到端自动驾驶解决方案

2019 International Conference on Field-Programmable Technology (ICFPT) Pub Date : 2019-10-16 DOI: 10.1109/ICFPT47387.2019.00084

Tian Wu, Weiyi Liu, Yongwei Jin

引用次数: 6

Unexpected Diversity: Quantitative Memory Analysis for Zynq UltraScale+ Systems 意想不到的多样性:Zynq UltraScale+系统的定量内存分析

2019 International Conference on Field-Programmable Technology (ICFPT) Pub Date : 2019-10-07 DOI: 10.1109/ICFPT47387.2019.00029

Kristiyan Manev, Anuj Vaishnav, Dirk Koch

引用次数: 17