2019 International Conference on Field-Programmable Technology (ICFPT)最新文献_第5页

Lightweight Programmable DSP Block Overlay for Streaming Neural Network Acceleration 用于流神经网络加速的轻量级可编程DSP块覆盖

2019 International Conference on Field-Programmable Technology (ICFPT) Pub Date : 2019-12-01 DOI: 10.1109/ICFPT47387.2019.00066

Lenos Ioannou, Suhaib A. Fahmy

引用次数: 3

Synchronizing On-Chip Software and Hardware Traces for HLS-Accelerated Programs 同步片上软件和硬件跟踪的hls加速程序

2019 International Conference on Field-Programmable Technology (ICFPT) Pub Date : 2019-12-01 DOI: 10.1109/ICFPT47387.2019.00015

M. Ashcraft, Jeffrey B. Goeders

{"title":"Synchronizing On-Chip Software and Hardware Traces for HLS-Accelerated Programs","authors":"M. Ashcraft, Jeffrey B. Goeders","doi":"10.1109/ICFPT47387.2019.00015","DOIUrl":"https://doi.org/10.1109/ICFPT47387.2019.00015","url":null,"abstract":"Complex designs generated from modern high-level synthesis tools allow users to take advantage of heterogeneous systems, splitting the execution of programs between conventional processors, and hardware accelerators. While modern HLS tools continue to improve in efficiency and capability, debugging these designs has received relatively minor attention. Fortunately, recent academic work has provided the first means to debug these designs using hardware and software traces. Though these traces allow the user to analyze the flow of execution on both the software and hardware individually, they provide no means of synchronization to determine how operations on one device affect the other. We address this challenge by introducing a synchronization technique that keeps track of operations on shared objects. We identify objects shared between hardware and software and their memory operations, and use unique identifiers to synchronize the traces around these operations. We explore the added costs of this technique on execution time and hardware and software resources, and ways to reduce it through multiple synchronization schemes. This is demonstrated in an open-source prototype targeting the hybrid flow of the open-source HLS-tool LegUp.","PeriodicalId":241340,"journal":{"name":"2019 International Conference on Field-Programmable Technology (ICFPT)","volume":"39 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131262092","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

Improving Memory Access Locality for Vectorized Bit-Serial Matrix Multiplication in Reconfigurable Computing 改进可重构计算中向量化位-序列矩阵乘法的存储器访问局部性

2019 International Conference on Field-Programmable Technology (ICFPT) Pub Date : 2019-12-01 DOI: 10.1109/ICFPT47387.2019.00081

Lahiru Rasnayake, Magnus Själander

引用次数: 1

[Title page i] [标题页i]

2019 International Conference on Field-Programmable Technology (ICFPT) Pub Date : 2019-12-01 DOI: 10.1109/icfpt47387.2019.00001

引用次数: 0

Supporters and sponsors 支持者和赞助者

2019 International Conference on Field-Programmable Technology (ICFPT) Pub Date : 2019-12-01 DOI: 10.1109/icfpt47387.2019.00008

引用次数: 0

Winograd-Based Real-Time Super-Resolution System on FPGA 基于winograd的FPGA实时超分辨率系统

2019 International Conference on Field-Programmable Technology (ICFPT) Pub Date : 2019-12-01 DOI: 10.1109/ICFPT47387.2019.00083

Bizhao Shi, Zhucheng Tang, Guojie Luo, M. Jiang

引用次数: 10

Optimized Polynomial Multiplier Over Commutative Rings on FPGAs: A Case Study on BIKE fpga交换环上多项式乘法器的优化:以BIKE为例

2019 International Conference on Field-Programmable Technology (ICFPT) Pub Date : 2019-12-01 DOI: 10.1109/ICFPT47387.2019.00035

Jingwei Hu, Wen Wang, R. Cheung, Huaxiong Wang

引用次数: 13

A High-Level Synthesis Approach to the Software/Hardware Codesign of NTT-Based Post-Quantum Cryptography Algorithms 基于ntt的后量子密码算法软硬件协同设计的高级综合方法

2019 International Conference on Field-Programmable Technology (ICFPT) Pub Date : 2019-12-01 DOI: 10.1109/ICFPT47387.2019.00070

D. Nguyen, V. Dang, K. Gaj

{"title":"A High-Level Synthesis Approach to the Software/Hardware Codesign of NTT-Based Post-Quantum Cryptography Algorithms","authors":"D. Nguyen, V. Dang, K. Gaj","doi":"10.1109/ICFPT47387.2019.00070","DOIUrl":"https://doi.org/10.1109/ICFPT47387.2019.00070","url":null,"abstract":"Due to an emerging threat of quantum computing, one of the major challenges facing the cryptographic community is a timely transition from traditional public-key cryptosystems, such as RSA and Elliptic Curve Cryptography, to a new class of algorithms, collectively referred to as Post-Quantum Cryptography (PQC). Several promising candidates for a new PQC standard can have their software and hardware implementations accelerated using the Num-ber Theoretic Transform (NTT). In this paper, we present an improved hardware architecture for NTT, with the hardware-friendly modular reduction, and demonstrate that this architecture can be efficiently implemented in hardware using High-Level Synthesis (HLS). The novel feature of the proposed architecture is an original memory write-back scheme, which assists in preparing coefficients for performing later NTT stages, saving memory storage used for precomputed constants. Our design is the most efficient for the case when log2N is even. The latency of our proposed architecture is approximately equal to (N log2(N) +3N)/4 clock cycles. As a proof of concept, we implemented the NTT operation for several parameter sets used in the PQC algorithms NewHope, FALCON, qTESLA, and CRYSTALS-DILITHIUM.","PeriodicalId":241340,"journal":{"name":"2019 International Conference on Field-Programmable Technology (ICFPT)","volume":"40 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126542008","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 22

Autonomous Driving Developed with an FPGA Design 基于FPGA的自动驾驶系统开发

2019 International Conference on Field-Programmable Technology (ICFPT) Pub Date : 2019-12-01 DOI: 10.1109/ICFPT47387.2019.00085

E. Jones, Keegan Pepper, Ai Li, Shiyue Li, Yuteng Zhang, D. Bailey

引用次数: 3

Partitioning FPGA-Optimized Systolic Arrays for Fun and Profit 划分fpga优化收缩阵列的乐趣和利润

2019 International Conference on Field-Programmable Technology (ICFPT) Pub Date : 2019-12-01 DOI: 10.1109/ICFPT47387.2019.00025

Long Chung Chan, G. Malik, Nachiket Kapre

{"title":"Partitioning FPGA-Optimized Systolic Arrays for Fun and Profit","authors":"Long Chung Chan, G. Malik, Nachiket Kapre","doi":"10.1109/ICFPT47387.2019.00025","DOIUrl":"https://doi.org/10.1109/ICFPT47387.2019.00025","url":null,"abstract":"We can improve the inference throughput of deep convolutional networks mapped to FPGA-optimized systolic arrays, at the expense of latency, with array partitioning and layer pipelining. Modern convolutional networks have a growing number of layers, such as the 58 separable layer GoogleNetv1, with varying compute, storage, and data movement requirements. At the same time, modern high-end FPGAs, such as the Xilinx UltraScale+ VU37P, can accommodate high-performance, 650 MHz, layouts of large 1920x9 systolic arrays. These can stay underutilized if the network layer requirements do not match the array size. We formulate an optimization problem, for improving array utilization, and boosting inference throughput, that determines how to partition the systolic array on the FPGA chip, and how to slice the network layers across the array partitions in a pipelined fashion. We adopt a two phase approach where (1) we identify layer assignment for each partition using an Evolutionary Strategy, and (2) we adopt a greedy-but-optimal approach for resource allocation to select the systolic array dimensions of each partition. When compared to state-of-the-art systolic architectures, we show throughput improvements in the range 1.3-1.5x and latency improvements in the range 0.5-1.8x against Multi-CLP and Xilinx SuperTile.","PeriodicalId":241340,"journal":{"name":"2019 International Conference on Field-Programmable Technology (ICFPT)","volume":"15 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116034860","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 3