2019 International Conference on Field-Programmable Technology (ICFPT)最新文献

筛选
英文 中文
Static Block Floating-Point Quantization for Convolutional Neural Networks on FPGA 基于FPGA的卷积神经网络静态块浮点量化
2019 International Conference on Field-Programmable Technology (ICFPT) Pub Date : 2019-12-01 DOI: 10.1109/ICFPT47387.2019.00012
Hongxiang Fan, Gang Wang, Martin Ferianc, Xinyu Niu, W. Luk
{"title":"Static Block Floating-Point Quantization for Convolutional Neural Networks on FPGA","authors":"Hongxiang Fan, Gang Wang, Martin Ferianc, Xinyu Niu, W. Luk","doi":"10.1109/ICFPT47387.2019.00012","DOIUrl":"https://doi.org/10.1109/ICFPT47387.2019.00012","url":null,"abstract":"Convolutional neural networks (CNNs) have been widely applied in various computer vision and speech processing applications. However, the algorithmic complexity of CNNs hinders their deployment in embedded systems with limited memory and computational resources. This paper proposes static block floating-point (BFP) quantization, an effective approach involving Kullback-Leibler divergence, to determine the static shared exponents. Without need for retraining, the proposed approach is able to quantize CNNs to 8 bits with negligible accuracy loss. An FPGA-based hardware design with static BFP quantization is also proposed. Compared with 8-bit integer linear quantization, our experiments show that the hardware kernel based on static BFP quantization can achieve over 50% reduction in logic resources on an FPGA. Based on static BFP quantization, a tool implemented in the PyTorch framework is developed, which can automatically generate optimised configuration according to user requirements for given CNN models, where the entire optimization process takes only a few minutes on an Intel Xeon Silver 4110 CPU.","PeriodicalId":241340,"journal":{"name":"2019 International Conference on Field-Programmable Technology (ICFPT)","volume":"60 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125313648","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 7
ZytleBot: FPGA Integrated Development Platform for ROS Based Autonomous Mobile Robot ZytleBot:基于ROS的自主移动机器人FPGA集成开发平台
2019 International Conference on Field-Programmable Technology (ICFPT) Pub Date : 2019-12-01 DOI: 10.1109/ICFPT47387.2019.00089
Yasuhiro Nitta, Sou Tamura, Hidetoshi Yugen, Hideki Takase
{"title":"ZytleBot: FPGA Integrated Development Platform for ROS Based Autonomous Mobile Robot","authors":"Yasuhiro Nitta, Sou Tamura, Hidetoshi Yugen, Hideki Takase","doi":"10.1109/ICFPT47387.2019.00089","DOIUrl":"https://doi.org/10.1109/ICFPT47387.2019.00089","url":null,"abstract":"The FPT2019 FPGA Design Competition is a competition aimed at recommending innovations in utilizing FPGAs to realize level 5 autonomous driving vehicles. We have developed \"ZytleBot\", an ROS based robot utilizing FPGA for the contest. ZytleBot calculates all processing necessary for autonomous driving in realtime within a programmable SoC. Therefore, it is possible to go around the course imitating an actual road, detect signals and obstacles and take appropriate behavior without any external operation. Robot development requires a wide range of knowledge and technology, but we have proceeded robot development efficiently by using ROS, a robot development middleware, and TurtleBot3, a robot development platform. As an application of FPGA, road surface image preprocessing and traffic signal detector using machine learning is implemented in the FPGA. The traffic signal detector uses HOG features and SVM classifiers, which runs over 270 times faster than running on the processor. We also provide ZytleBot as a platform for efficient development of FPGA integrated ROS robots.","PeriodicalId":241340,"journal":{"name":"2019 International Conference on Field-Programmable Technology (ICFPT)","volume":"46 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132411018","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Secure Internal Communication of a Trustzone-Enabled Heterogeneous Soc Lightweight Encryption 基于trustzone的异构Soc轻量级加密的内部通信安全
2019 International Conference on Field-Programmable Technology (ICFPT) Pub Date : 2019-12-01 DOI: 10.1109/ICFPT47387.2019.00037
E. M. Benhani, C. M. López, L. Bossuet
{"title":"Secure Internal Communication of a Trustzone-Enabled Heterogeneous Soc Lightweight Encryption","authors":"E. M. Benhani, C. M. López, L. Bossuet","doi":"10.1109/ICFPT47387.2019.00037","DOIUrl":"https://doi.org/10.1109/ICFPT47387.2019.00037","url":null,"abstract":"Security in TrustZone-enabled heterogeneous system-on-chip (SoC) is gaining increasing attention for several years. Mainly because this type of SoC can be found in more and more applications in servers or in the cloud. The inside-SoC communication layer is one of the main element of heterogeneous SoC; indeed all the data goes through it. Monitoring and controlling inside-SoC communications enables to fend off attacks before system corruption. In this article, we study the feasibility of encrypted data exchange between the secure software executed in a trusted execution environment (TEE) and the secure logic part of an heterogeneous SoC. Experiment are done with a Xilinx Zynq-7010 SoC and two lightweight stream ciphers. We show that using lightweight stream ciphers is an efficient solution without excessive overheads.","PeriodicalId":241340,"journal":{"name":"2019 International Conference on Field-Programmable Technology (ICFPT)","volume":"35 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115096277","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
OpenCL Implementation of Cannon’s Matrix Multiplication Algorithm on Intel Stratix 10 FPGAs 基于Intel Stratix 10 fpga的Cannon矩阵乘法算法的OpenCL实现
2019 International Conference on Field-Programmable Technology (ICFPT) Pub Date : 2019-12-01 DOI: 10.1109/ICFPT47387.2019.00020
Paolo Gorlani, Tobias Kenter, Christian Plessl
{"title":"OpenCL Implementation of Cannon’s Matrix Multiplication Algorithm on Intel Stratix 10 FPGAs","authors":"Paolo Gorlani, Tobias Kenter, Christian Plessl","doi":"10.1109/ICFPT47387.2019.00020","DOIUrl":"https://doi.org/10.1109/ICFPT47387.2019.00020","url":null,"abstract":"Stratix 10 FPGA cards have a good potential for the acceleration of HPC workloads since the Stratix 10 product line introduces devices with a large number of DSP and memory blocks. The high level synthesis of OpenCL codes can play a fundamental role for FPGAs in HPC, because it allows to implement different designs with lower development effort compared to hand optimized HDL. However, Stratix 10 cards are still hard to fully exploit using the Intel FPGA SDK for OpenCL. The implementation of designs with thousands of concurrent arithmetic operations often suffers from place and route problems that limit the maximum frequency or entirely prevent a successful synthesis. In order to overcome these issues for the implementation of the matrix multiplication, we formulate Cannon's matrix multiplication algorithm with regard to its efficient synthesis within the FPGA logic. We obtain a two-level block algorithm, where the lower level sub-matrices are multiplied using our Cannon's algorithm implementation. Following this design approach with multiple compute units, we are able to get maximum frequencies close to and above 300 MHz with high utilization of DSP and memory blocks. This allows for performance results above 1 TeraFLOPS.","PeriodicalId":241340,"journal":{"name":"2019 International Conference on Field-Programmable Technology (ICFPT)","volume":"134 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123449085","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 12
Implementation of Distributed Processing Using a PC-FPGA Hybrid System 用PC-FPGA混合系统实现分布式处理
2019 International Conference on Field-Programmable Technology (ICFPT) Pub Date : 2019-12-01 DOI: 10.1109/ICFPT47387.2019.00074
Keisuke Takano, Tetsuya Oda, Ryo Ozaki, A. Uejima, M. Kohata
{"title":"Implementation of Distributed Processing Using a PC-FPGA Hybrid System","authors":"Keisuke Takano, Tetsuya Oda, Ryo Ozaki, A. Uejima, M. Kohata","doi":"10.1109/ICFPT47387.2019.00074","DOIUrl":"https://doi.org/10.1109/ICFPT47387.2019.00074","url":null,"abstract":"Recent research has focused on decreasing processing times and power consumption by combined use PCs and FPGAs. Here describe a system that performs distributed processing across multiple PCs and FPGAs using ethernet for PC connections and an in-house designed FPGA network to connect FPGAs. For processing software, we used Apache Spark on the PCs and an in-house application programming interface (API) for the FPGAs. Experimental results show that both PC and FPGA are effectively used for image compression calculation.","PeriodicalId":241340,"journal":{"name":"2019 International Conference on Field-Programmable Technology (ICFPT)","volume":"83 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126195487","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Extending the Lifetime of Coarse-Grained Runtime Reconfigurable FPGAs by Balancing Processing Element Usage 通过平衡处理元素的使用来延长粗粒度运行时可重构fpga的寿命
2019 International Conference on Field-Programmable Technology (ICFPT) Pub Date : 2019-12-01 DOI: 10.1109/ICFPT47387.2019.00050
Bo Hu, M. Shihab, Y. Makris, Benjamin Carrión Schäfer, C. Sechen
{"title":"Extending the Lifetime of Coarse-Grained Runtime Reconfigurable FPGAs by Balancing Processing Element Usage","authors":"Bo Hu, M. Shihab, Y. Makris, Benjamin Carrión Schäfer, C. Sechen","doi":"10.1109/ICFPT47387.2019.00050","DOIUrl":"https://doi.org/10.1109/ICFPT47387.2019.00050","url":null,"abstract":"Contemporary coarse-grained runtime reconfigurable architectures (CGRRAs) are increasingly sensitive to aging effects such as Negative Bias Temperature Instability (NBTI). To address this, we propose a reliability-aware floorplanner for CGRRAs based on a mixed Linear Programming (LP) and Integer Linear Programming (ILP) method that extends the Mean Time to Failure (MTTF) of CGRRAs by balancing processing element (PE) usage. We use this as the basis of a design space explorer that generates a variety of configurations, trading off PE displacement vs. MTTF. On average, a 2.4× improvement in MTTF was obtained for an average critical path delay increase of under 2 percent (although most benchmarks had no delay increase) compared to the default lifetime unaware floorplan.","PeriodicalId":241340,"journal":{"name":"2019 International Conference on Field-Programmable Technology (ICFPT)","volume":"2013 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129608893","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
FPNet: Customized Convolutional Neural Network for FPGA Platforms FPGA平台的自定义卷积神经网络
2019 International Conference on Field-Programmable Technology (ICFPT) Pub Date : 2019-12-01 DOI: 10.1109/ICFPT47387.2019.00077
Y. Yang, Chao Wang, Lei Gong, Xuehai Zhou
{"title":"FPNet: Customized Convolutional Neural Network for FPGA Platforms","authors":"Y. Yang, Chao Wang, Lei Gong, Xuehai Zhou","doi":"10.1109/ICFPT47387.2019.00077","DOIUrl":"https://doi.org/10.1109/ICFPT47387.2019.00077","url":null,"abstract":"The Convolutional Neural Network (CNN) has been widely adopted in various applications, which include object detection mobile robot vision, image search engine, etc. And due to the computing-intensive and memory-intensive features of CNN models, specialized hardware accelerators, like Application-Specific-Integrated-Circuit (ASIC) and Field Programmable Gate Arrays (FPGA) have been widely utilized in edge devices. Among all the neural network specialized hardware accelerators, an FPGA accelerator stands out for its flexibility, short time-to-market, and energy efficiency. Previous works about FPGA accelerator designs are mostly changing hardware configuration to fit the CNN model structure. In this paper, we propose an automated neural network architecture search approach utilizing reinforcement learning to design customized convolutional neural network model for specialized FPGA platforms.","PeriodicalId":241340,"journal":{"name":"2019 International Conference on Field-Programmable Technology (ICFPT)","volume":"144 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127508915","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 8
Optimizing FPGA-Based Streaming Applications for Throughput Using Pipelining 使用流水线优化基于fpga的流应用程序的吞吐量
2019 International Conference on Field-Programmable Technology (ICFPT) Pub Date : 2019-12-01 DOI: 10.1109/ICFPT47387.2019.00065
Ali Asghar, R. V. Loo, Timon Kruiper, Daniel Ziener
{"title":"Optimizing FPGA-Based Streaming Applications for Throughput Using Pipelining","authors":"Ali Asghar, R. V. Loo, Timon Kruiper, Daniel Ziener","doi":"10.1109/ICFPT47387.2019.00065","DOIUrl":"https://doi.org/10.1109/ICFPT47387.2019.00065","url":null,"abstract":"In this paper, we present an automated flow for insertion of pipeline stages in FPGA-based streaming applications in order to increase the throughput. The proposed approach involves the utilization of Xilinx's Automated Pipeline Analysis tool to estimate the number of pipeline stages, while the Rapid-Wright framework incorporate these stages into a synthesized design. The Vivado Design Suite is then used to place and route the modified netlist. Furthermore, a recycling approach has also been proposed to reduce excess registers. The results show a significant improvement in the maximum operating frequency for designs without any sequential loops (~51%) with a moderate resource overhead, while slight gains (~12%) were also observed for designs containing feedback loops.","PeriodicalId":241340,"journal":{"name":"2019 International Conference on Field-Programmable Technology (ICFPT)","volume":"16 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123108394","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
RNA: Reconfigurable LSTM Accelerator with Near Data Approximate Processing 具有近数据近似处理的可重构LSTM加速器
2019 International Conference on Field-Programmable Technology (ICFPT) Pub Date : 2019-12-01 DOI: 10.1109/ICFPT47387.2019.00055
Yu Gong, Bo Liu, Wei-qi Ge, Longxing Shi
{"title":"RNA: Reconfigurable LSTM Accelerator with Near Data Approximate Processing","authors":"Yu Gong, Bo Liu, Wei-qi Ge, Longxing Shi","doi":"10.1109/ICFPT47387.2019.00055","DOIUrl":"https://doi.org/10.1109/ICFPT47387.2019.00055","url":null,"abstract":"Near Data Processing(NDP) techniques are introduced into deep learning accelerators as they can greatly relieve the pressure on memory bandwidth. Besides, approximate computing is also adopted in accelerating neural networks for the network fault-tolerance to reduce energy consumption. In this paper, an NDP accelerator with approximate computing features for LSTM is proposed to explore the data parallelism with reconfigurable features. Firstly, a hybrid-grained network partitioning model with scheduling strategy of LSTM is put forward to achieve high processing parallelism. Secondly, the approximate computing units are designed for LSTM with adaptive precision. Then the heterogeneous architecture, RNA, with reconfigurable computing arrays and approximate NDP units is proposed and implemented regarding the configuration code. The gates and cells in LSTM are modeled into fine-grained operations, organized in coarse-grained tasks, and then mapped onto RNA. In addition, approximate computing units are integrated into the NDP units with the adaptive precision, which is also controlled by the configuration codes. The proposed RNA architecture achieved 544 GOPS/W energy efficiency while processing LSTM, and further can be extended for larger and more complex recurrent neural networks. Comparing with the state-of-the-art accelerator for LSTM, it is 2.14 times better in efficiency.","PeriodicalId":241340,"journal":{"name":"2019 International Conference on Field-Programmable Technology (ICFPT)","volume":"8 2","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114006885","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
SpWMM: A High-Performance Sparse-Winograd Matrix-Matrix Multiplication Accelerator for CNNs SpWMM:一种用于cnn的高性能稀疏- winograd矩阵-矩阵乘法加速器
2019 International Conference on Field-Programmable Technology (ICFPT) Pub Date : 2019-12-01 DOI: 10.1109/ICFPT47387.2019.00041
Di Wu, Wei Cao, Lingli Wang
{"title":"SpWMM: A High-Performance Sparse-Winograd Matrix-Matrix Multiplication Accelerator for CNNs","authors":"Di Wu, Wei Cao, Lingli Wang","doi":"10.1109/ICFPT47387.2019.00041","DOIUrl":"https://doi.org/10.1109/ICFPT47387.2019.00041","url":null,"abstract":"In recent years, many CNN accelerators are proposed to exploit the sparsity of the networks to enjoy the benefits of both computation and memory reduction. However, these accelerators either cannot exploit the sparsity of both activations and weights, or cannot achieve stable performance with a static scheduling strategy, which is vulnerable to the sparsity distribution. This paper proposes a dynamic scheduling strategy and a balanced compressed sparse row (BCSR) format to efficiently address these two issues. A set-associate structure is presented to tradeoff the load balance and logic overhead. We propose SpWMM to accelerate the CNN inference, which is the first work to implement both sparse Winograd convolution and sparse fully-connected (FC) layers. On contemporary neural networks, this work achieves: (1) 2.6Top/s for Winograd convolution and 525Gop/s for 1×1 convolution and FC layers in the 4-way association design on Xilinx ZC706 platform, (2) 6.5 Top/s for Winograd convolution and 1.2Top/s for 1×1 convolution and FC layers in the 16-way association design on Xilinx VCU1525 platform. Compared with the state-of-the-art works on the same platform, the 4-way design achieves 2.0× speedup.","PeriodicalId":241340,"journal":{"name":"2019 International Conference on Field-Programmable Technology (ICFPT)","volume":"133 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122427209","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
相关产品
×
本文献相关产品
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信