2019 International Conference on Field-Programmable Technology (ICFPT)最新文献

筛选
英文 中文
Lightweight Programmable DSP Block Overlay for Streaming Neural Network Acceleration 用于流神经网络加速的轻量级可编程DSP块覆盖
2019 International Conference on Field-Programmable Technology (ICFPT) Pub Date : 2019-12-01 DOI: 10.1109/ICFPT47387.2019.00066
Lenos Ioannou, Suhaib A. Fahmy
{"title":"Lightweight Programmable DSP Block Overlay for Streaming Neural Network Acceleration","authors":"Lenos Ioannou, Suhaib A. Fahmy","doi":"10.1109/ICFPT47387.2019.00066","DOIUrl":"https://doi.org/10.1109/ICFPT47387.2019.00066","url":null,"abstract":"Implementations of hardware accelerators for neural networks are increasingly popular on FPGAs, due to flexibility, achievable performance and efficiency gains resulting from network optimisations. The long compilation time required by the backend toolflow, however, makes rapid deployment and prototyping of such accelerators on FPGAs more difficult. Moreover, achieving high frequency of operation requires significant low-level design effort. We present a neural network overlay for FPGAs that exploits DSP blocks, operating at near their theoretical maximum frequency, while minimizing resource utilization. The proposed architecture is flexible, enabling rapid runtime configuration of network parameters according to the desired network topology. It is tailored for lightweight edge implementations requiring acceleration, rather than the highest throughput achieved by more complex architectures in the datacenter.","PeriodicalId":241340,"journal":{"name":"2019 International Conference on Field-Programmable Technology (ICFPT)","volume":"30 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130832515","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
Synchronizing On-Chip Software and Hardware Traces for HLS-Accelerated Programs 同步片上软件和硬件跟踪的hls加速程序
2019 International Conference on Field-Programmable Technology (ICFPT) Pub Date : 2019-12-01 DOI: 10.1109/ICFPT47387.2019.00015
M. Ashcraft, Jeffrey B. Goeders
{"title":"Synchronizing On-Chip Software and Hardware Traces for HLS-Accelerated Programs","authors":"M. Ashcraft, Jeffrey B. Goeders","doi":"10.1109/ICFPT47387.2019.00015","DOIUrl":"https://doi.org/10.1109/ICFPT47387.2019.00015","url":null,"abstract":"Complex designs generated from modern high-level synthesis tools allow users to take advantage of heterogeneous systems, splitting the execution of programs between conventional processors, and hardware accelerators. While modern HLS tools continue to improve in efficiency and capability, debugging these designs has received relatively minor attention. Fortunately, recent academic work has provided the first means to debug these designs using hardware and software traces. Though these traces allow the user to analyze the flow of execution on both the software and hardware individually, they provide no means of synchronization to determine how operations on one device affect the other. We address this challenge by introducing a synchronization technique that keeps track of operations on shared objects. We identify objects shared between hardware and software and their memory operations, and use unique identifiers to synchronize the traces around these operations. We explore the added costs of this technique on execution time and hardware and software resources, and ways to reduce it through multiple synchronization schemes. This is demonstrated in an open-source prototype targeting the hybrid flow of the open-source HLS-tool LegUp.","PeriodicalId":241340,"journal":{"name":"2019 International Conference on Field-Programmable Technology (ICFPT)","volume":"39 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131262092","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Improving Memory Access Locality for Vectorized Bit-Serial Matrix Multiplication in Reconfigurable Computing 改进可重构计算中向量化位-序列矩阵乘法的存储器访问局部性
2019 International Conference on Field-Programmable Technology (ICFPT) Pub Date : 2019-12-01 DOI: 10.1109/ICFPT47387.2019.00081
Lahiru Rasnayake, Magnus Själander
{"title":"Improving Memory Access Locality for Vectorized Bit-Serial Matrix Multiplication in Reconfigurable Computing","authors":"Lahiru Rasnayake, Magnus Själander","doi":"10.1109/ICFPT47387.2019.00081","DOIUrl":"https://doi.org/10.1109/ICFPT47387.2019.00081","url":null,"abstract":"Low-precision matrix multiplication has gained significant interest in the research community due to its applicability in the quantized neural network domain. As a result, a multitude of variable precision hardware designs have been proposed since fixed-precision hardware causes under-utilization of the hardware resources due to the low and varying precision in such applications. Bit-serial hardware takes advantage of the frugal nature of bit-serial computations that can operate on only as many bits as necessary. A bit-serial matrix multiplication consists of a summation of weighted binary matrix multiplications. In this work, we study the inherent locality of bit-serial matrix multiplications and propose a locality-aware scheduling algorithm that eliminates redundant data fetches from memory. The proposed schedule improves with up to 76% compared to a schedule that computes each binary matrix multiplication in sequence.","PeriodicalId":241340,"journal":{"name":"2019 International Conference on Field-Programmable Technology (ICFPT)","volume":"50 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121803485","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
[Title page i] [标题页i]
2019 International Conference on Field-Programmable Technology (ICFPT) Pub Date : 2019-12-01 DOI: 10.1109/icfpt47387.2019.00001
{"title":"[Title page i]","authors":"","doi":"10.1109/icfpt47387.2019.00001","DOIUrl":"https://doi.org/10.1109/icfpt47387.2019.00001","url":null,"abstract":"","PeriodicalId":241340,"journal":{"name":"2019 International Conference on Field-Programmable Technology (ICFPT)","volume":"29 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125314370","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Supporters and sponsors 支持者和赞助者
2019 International Conference on Field-Programmable Technology (ICFPT) Pub Date : 2019-12-01 DOI: 10.1109/icfpt47387.2019.00008
{"title":"Supporters and sponsors","authors":"","doi":"10.1109/icfpt47387.2019.00008","DOIUrl":"https://doi.org/10.1109/icfpt47387.2019.00008","url":null,"abstract":"","PeriodicalId":241340,"journal":{"name":"2019 International Conference on Field-Programmable Technology (ICFPT)","volume":"15 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125565888","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Winograd-Based Real-Time Super-Resolution System on FPGA 基于winograd的FPGA实时超分辨率系统
2019 International Conference on Field-Programmable Technology (ICFPT) Pub Date : 2019-12-01 DOI: 10.1109/ICFPT47387.2019.00083
Bizhao Shi, Zhucheng Tang, Guojie Luo, M. Jiang
{"title":"Winograd-Based Real-Time Super-Resolution System on FPGA","authors":"Bizhao Shi, Zhucheng Tang, Guojie Luo, M. Jiang","doi":"10.1109/ICFPT47387.2019.00083","DOIUrl":"https://doi.org/10.1109/ICFPT47387.2019.00083","url":null,"abstract":"With the rapid development of computer vision theory and visual display devices, High Frame Rate (HFR) and Ultra High Definition (UHD) techniques have received increasing attention from academic and industry. As they put high demands on performance and energy-efficiency, efficient customized hardware is required. In this paper, we propose an FPGA-based super-resolution system that enables real-time UHD upscaling in both high image quality and high frame rates. Our system crops each frame into blocks, measures their total variation values, and dispatches them accordingly to a neural network or an interpolation module for upscaling. We also propose a fast transposed convolution algorithm based on Winograd algorithm, which reduces the number of multiplications. Experimental results show that the proposed super-resolution system achieves superior performance in both reconstruction performance and efficiency over previous works.","PeriodicalId":241340,"journal":{"name":"2019 International Conference on Field-Programmable Technology (ICFPT)","volume":"73 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116684399","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 10
Optimized Polynomial Multiplier Over Commutative Rings on FPGAs: A Case Study on BIKE fpga交换环上多项式乘法器的优化:以BIKE为例
2019 International Conference on Field-Programmable Technology (ICFPT) Pub Date : 2019-12-01 DOI: 10.1109/ICFPT47387.2019.00035
Jingwei Hu, Wen Wang, R. Cheung, Huaxiong Wang
{"title":"Optimized Polynomial Multiplier Over Commutative Rings on FPGAs: A Case Study on BIKE","authors":"Jingwei Hu, Wen Wang, R. Cheung, Huaxiong Wang","doi":"10.1109/ICFPT47387.2019.00035","DOIUrl":"https://doi.org/10.1109/ICFPT47387.2019.00035","url":null,"abstract":"In this paper, we present two constant-time FPGAbased polynomial multipliers for post-quantum secure key encapsulation mechanisms based on quasi-cyclic codes, which are among round 2 candidates in the NIST PQC standardization process. The pipelined hardware architecture for polynomial multiplications proposed in this work are fully parameterized in terms of the size of the polynomial, and can be further tuned flexibly to achieve a trade-off between time and area depending on individual needs. We also present a case study on the BIKE key generators which use these two polynomial multiplier architectures as building blocks. Compared with the state-of-the-art hardware implementation of BIKE, the design proposed in this work is around 9× faster in terms of run-time while maintaining an over 6× smaller time-area product.","PeriodicalId":241340,"journal":{"name":"2019 International Conference on Field-Programmable Technology (ICFPT)","volume":"40 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116491130","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 13
A High-Level Synthesis Approach to the Software/Hardware Codesign of NTT-Based Post-Quantum Cryptography Algorithms 基于ntt的后量子密码算法软硬件协同设计的高级综合方法
2019 International Conference on Field-Programmable Technology (ICFPT) Pub Date : 2019-12-01 DOI: 10.1109/ICFPT47387.2019.00070
D. Nguyen, V. Dang, K. Gaj
{"title":"A High-Level Synthesis Approach to the Software/Hardware Codesign of NTT-Based Post-Quantum Cryptography Algorithms","authors":"D. Nguyen, V. Dang, K. Gaj","doi":"10.1109/ICFPT47387.2019.00070","DOIUrl":"https://doi.org/10.1109/ICFPT47387.2019.00070","url":null,"abstract":"Due to an emerging threat of quantum computing, one of the major challenges facing the cryptographic community is a timely transition from traditional public-key cryptosystems, such as RSA and Elliptic Curve Cryptography, to a new class of algorithms, collectively referred to as Post-Quantum Cryptography (PQC). Several promising candidates for a new PQC standard can have their software and hardware implementations accelerated using the Num-ber Theoretic Transform (NTT). In this paper, we present an improved hardware architecture for NTT, with the hardware-friendly modular reduction, and demonstrate that this architecture can be efficiently implemented in hardware using High-Level Synthesis (HLS). The novel feature of the proposed architecture is an original memory write-back scheme, which assists in preparing coefficients for performing later NTT stages, saving memory storage used for precomputed constants. Our design is the most efficient for the case when log2N is even. The latency of our proposed architecture is approximately equal to (N log2(N) +3N)/4 clock cycles. As a proof of concept, we implemented the NTT operation for several parameter sets used in the PQC algorithms NewHope, FALCON, qTESLA, and CRYSTALS-DILITHIUM.","PeriodicalId":241340,"journal":{"name":"2019 International Conference on Field-Programmable Technology (ICFPT)","volume":"40 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126542008","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 22
Autonomous Driving Developed with an FPGA Design 基于FPGA的自动驾驶系统开发
2019 International Conference on Field-Programmable Technology (ICFPT) Pub Date : 2019-12-01 DOI: 10.1109/ICFPT47387.2019.00085
E. Jones, Keegan Pepper, Ai Li, Shiyue Li, Yuteng Zhang, D. Bailey
{"title":"Autonomous Driving Developed with an FPGA Design","authors":"E. Jones, Keegan Pepper, Ai Li, Shiyue Li, Yuteng Zhang, D. Bailey","doi":"10.1109/ICFPT47387.2019.00085","DOIUrl":"https://doi.org/10.1109/ICFPT47387.2019.00085","url":null,"abstract":"For this project, the task is developing algorithms to program a FPGA controlled vehicle for the FPT'19 design competition. This competition is to encourage the development of level 5 self-driving cars. To achieve level 5 self-driving cars, the use of image processing for object detection, lane detection, traffic light detection and pedestrian detection will be required. With this detection, the vehicle can be guided safely around the track provided. This paper summarises the algorithms developed to achieve autonomous driving techniques, following the regulations of the competition.","PeriodicalId":241340,"journal":{"name":"2019 International Conference on Field-Programmable Technology (ICFPT)","volume":"12 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126692331","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
Partitioning FPGA-Optimized Systolic Arrays for Fun and Profit 划分fpga优化收缩阵列的乐趣和利润
2019 International Conference on Field-Programmable Technology (ICFPT) Pub Date : 2019-12-01 DOI: 10.1109/ICFPT47387.2019.00025
Long Chung Chan, G. Malik, Nachiket Kapre
{"title":"Partitioning FPGA-Optimized Systolic Arrays for Fun and Profit","authors":"Long Chung Chan, G. Malik, Nachiket Kapre","doi":"10.1109/ICFPT47387.2019.00025","DOIUrl":"https://doi.org/10.1109/ICFPT47387.2019.00025","url":null,"abstract":"We can improve the inference throughput of deep convolutional networks mapped to FPGA-optimized systolic arrays, at the expense of latency, with array partitioning and layer pipelining. Modern convolutional networks have a growing number of layers, such as the 58 separable layer GoogleNetv1, with varying compute, storage, and data movement requirements. At the same time, modern high-end FPGAs, such as the Xilinx UltraScale+ VU37P, can accommodate high-performance, 650 MHz, layouts of large 1920x9 systolic arrays. These can stay underutilized if the network layer requirements do not match the array size. We formulate an optimization problem, for improving array utilization, and boosting inference throughput, that determines how to partition the systolic array on the FPGA chip, and how to slice the network layers across the array partitions in a pipelined fashion. We adopt a two phase approach where (1) we identify layer assignment for each partition using an Evolutionary Strategy, and (2) we adopt a greedy-but-optimal approach for resource allocation to select the systolic array dimensions of each partition. When compared to state-of-the-art systolic architectures, we show throughput improvements in the range 1.3-1.5x and latency improvements in the range 0.5-1.8x against Multi-CLP and Xilinx SuperTile.","PeriodicalId":241340,"journal":{"name":"2019 International Conference on Field-Programmable Technology (ICFPT)","volume":"15 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116034860","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
相关产品
×
本文献相关产品
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信