Proceedings of the 2018 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays最新文献

筛选
英文 中文
Automatic Optimising CNN with Depthwise Separable Convolution on FPGA: (Abstact Only) 基于FPGA的深度可分离卷积自动优化CNN(摘要)
Ruizhe Zhao, Xinyu Niu, W. Luk
{"title":"Automatic Optimising CNN with Depthwise Separable Convolution on FPGA: (Abstact Only)","authors":"Ruizhe Zhao, Xinyu Niu, W. Luk","doi":"10.1145/3174243.3174959","DOIUrl":"https://doi.org/10.1145/3174243.3174959","url":null,"abstract":"Convolution layers in Convolutional Neural Networks (CNNs) are effective in vision feature extraction but quite inefficient in computational resource usage. Depthwise separable convolution layer has been proposed in recent publications to enhance the efficiency without reducing the effectiveness by separately computing the spatial and cross-channel correlations from input images and has proven successful in state-of-the-art networks such as MobileNets [1] and Xception [2]. Based on the facts that depthwise separable convolution is highly structured and uses limited resources, we argue that it can well fit reconfigurable platforms like FPGA. To benefit FPGA platforms with this new layer, in this paper, we present a novel framework that can automatically generate and optimise hardware designs for depthwise separable CNNs. Besides, in our framework, existing conventional CNNs can be systematically converted to ones whose standard convolution layers are selectively replaced with functionally identical depthwise separable convolution layers, by carefully balancing the trade-off among speed, accuracy, and resource usage through resource usage modelling and network fine-tuning. Results show that hardware designs generated by our framework can reach at most 231.7 frames per second regarding MobileNets, and for VGG-16 [3], we gain 3.43 times speed-up and 3.54% accuracy decrease on the ImageNet [4] dataset comparing the original model and a layer replaced one.","PeriodicalId":164936,"journal":{"name":"Proceedings of the 2018 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays","volume":"9 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-02-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127688616","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 20
Towards a Uniform Template-based Architecture for Accelerating 2D and 3D CNNs on FPGA 基于FPGA的二维和三维cnn加速统一模板架构研究
Junzhong Shen, Y. Huang, Zelong Wang, Yuran Qiao, M. Wen, Chunyuan Zhang
{"title":"Towards a Uniform Template-based Architecture for Accelerating 2D and 3D CNNs on FPGA","authors":"Junzhong Shen, Y. Huang, Zelong Wang, Yuran Qiao, M. Wen, Chunyuan Zhang","doi":"10.1145/3174243.3174257","DOIUrl":"https://doi.org/10.1145/3174243.3174257","url":null,"abstract":"Three-dimensional convolutional neural networks (3D CNNs) are used efficiently in many computer vision applications. Most previous work in this area has concentrated only on designing and optimizing accelerators for 2D CNN, with few attempts made to accelerate 3D CNN on FPGA. We find accelerating 3D CNNs on FPGA to be challenge due to their high computational complexity and storage demands. More importantly, although the computation patterns of 2D and 3D CNNs are analogous, the conventional approaches adopted for accelerating 2D CNNs may be unfit for 3D CNN acceleration. In this paper, in order to accelerate 2D and 3D CNNs using a uniform framework, we propose a uniform template-based architecture that uses templates based on the Winograd algorithm to ensure fast development of 2D and 3D CNN accelerators. Furthermore, we also develop a uniform analytical model to facilitate efficient design space explorations of 2D and 3D CNN accelerators based on our architecture. Finally, we demonstrate the effectiveness of the template-based architecture by implementing accelerators for real-life 2D and 3D CNNs (VGG16 and C3D) on multiple FPGA platforms. On S2C VUS440, we achieve up to 1.13 TOPS and 1.11 TOPS under low resource utilization for VGG16 and C3D, respectively. End-to-end comparisons with CPU and GPU solutions demonstrate that our implementation of C3D achieves gains of up to 13x and 60x in performance and energy relative to a CPU solution, and a 6.4x energy efficiency gain over a GPU solution.","PeriodicalId":164936,"journal":{"name":"Proceedings of the 2018 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays","volume":"90 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-02-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127727528","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 80
Solving Satisfiability Problem on Quantum Annealer: A Lesson from FPGA CAD Tools: (Abstract Only) 求解量子退火的可满足性问题:FPGA CAD工具的启示(摘要)
J. Su, Lei He
{"title":"Solving Satisfiability Problem on Quantum Annealer: A Lesson from FPGA CAD Tools: (Abstract Only)","authors":"J. Su, Lei He","doi":"10.1145/3174243.3174972","DOIUrl":"https://doi.org/10.1145/3174243.3174972","url":null,"abstract":"Recently, a practical quantum annealing device has been commercialized by D-Wave Systems, sparking research interest in developing applications to solve problems that are intractable for classical computer. This paper provides a tutorial for using quantum annealer to solve Boolean satisfiability problem. We explain the computational model of quantum annealer and discuss the detailed mapping technique inspired by FPGA CAD flow, including stages such as logic optimization, placement and routing.","PeriodicalId":164936,"journal":{"name":"Proceedings of the 2018 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays","volume":"28 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-02-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134441833","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A Scalable Approach to Exact Resource-Constrained Scheduling Based on a Joint SDC and SAT Formulation 基于SDC和SAT联合公式的可扩展精确资源约束调度方法
Steve Dai, Gai Liu, Zhiru Zhang
{"title":"A Scalable Approach to Exact Resource-Constrained Scheduling Based on a Joint SDC and SAT Formulation","authors":"Steve Dai, Gai Liu, Zhiru Zhang","doi":"10.1145/3174243.3174268","DOIUrl":"https://doi.org/10.1145/3174243.3174268","url":null,"abstract":"Despite increasing adoption of high-level synthesis (HLS) for its design productivity advantage, success in achieving high quality-of-results out-of-the-box is often hindered by the inexactness of the common HLS optimizations. In particular, while scheduling forms the algorithmic core to HLS technology, current scheduling algorithms rely heavily on fundamentally inexact heuristics that make ad hoc local decisions and cannot accurately and globally optimize over a rich set of constraints. To tackle this challenge, we propose a scheduling formulation based on system of integer difference constraints (SDC) and Boolean satisfiability (SAT) to exactly handle a variety of scheduling constraints. We develop a specialized scheduler based on conflict-driven learning and problem-specific knowledge to optimally and efficiently solve the resource-constrained scheduling problem. By leveraging the efficiency of SDC algorithms and scalability of modern SAT solvers, our scheduling technique is able to achieve on average over 100x improvement in runtime over the integer linear programming (ILP) approach while attaining optimal latency. By integrating our scheduling formulation into a state-of-the-art open-source HLS tool, we further demonstrate the applicability of our scheduling technique with a suite of representative benchmarks targeting FPGAs.","PeriodicalId":164936,"journal":{"name":"Proceedings of the 2018 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays","volume":"17 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-02-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132925878","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 14
FPGAs in the Datacenters: the Case of Parallel Hybrid Super Scalar String Sample Sort (pHS5)(Abstract Only) fpga在数据中心中的应用:并行混合标量字符串样本排序(pHS5)(仅摘要)
Mikhail Asiatici, Damian Maiorano, P. Ienne
{"title":"FPGAs in the Datacenters: the Case of Parallel Hybrid Super Scalar String Sample Sort (pHS5)(Abstract Only)","authors":"Mikhail Asiatici, Damian Maiorano, P. Ienne","doi":"10.1145/3174243.3174993","DOIUrl":"https://doi.org/10.1145/3174243.3174993","url":null,"abstract":"String sorting is an important part of database and MapReduce applications; however, it has not been studied as extensively as sorting of fixed-length keys. Handling variable-length keys in hardware is challenging and it is no surprise that no string sorters on FPGA have been proposed yet. We present Parallel Hybrid Super Scalar String Sample Sort (pHS5) on Intel HARPv2, a heterogeneous CPU-FPGA system with a server-grade multi-core CPU. Our pHS5 is based on the state-of-the-art string sorting algorithm for multi-core shared memory CPUs, pS5, which we extended with multiple processing elements (PEs) on the FPGA. Each PE accelerates one instance of the most effectively parallelizable dominant kernel of pS5 by up to 33% compared to a single Intel Xeon Broadwell core running at 3.4 GHz. Furthermore, we extended the job scheduling mechanism of pS5 to enable our PEs to compete with the CPU cores for processing the accelerable kernel, while retaining the complex high-level control flow and the sorting of the smaller data sets on the CPU. We accelerate the whole algorithm by up to 10% compared to the 28 thread software baseline running on the 14-core Xeon processor and by up to 36% at lower thread counts.","PeriodicalId":164936,"journal":{"name":"Proceedings of the 2018 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays","volume":"13 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-02-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127114012","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Combined Spatial and Temporal Blocking for High-Performance Stencil Computation on FPGAs Using OpenCL 基于OpenCL的fpga高性能模板计算空间和时间组合块
H. Zohouri, Artur Podobas, S. Matsuoka
{"title":"Combined Spatial and Temporal Blocking for High-Performance Stencil Computation on FPGAs Using OpenCL","authors":"H. Zohouri, Artur Podobas, S. Matsuoka","doi":"10.1145/3174243.3174248","DOIUrl":"https://doi.org/10.1145/3174243.3174248","url":null,"abstract":"Recent developments in High Level Synthesis tools have attracted software programmers to accelerate their high-performance computing applications on FPGAs. Even though it has been shown that FPGAs can compete with GPUs in terms of performance for stencil computation, most previous work achieve this by avoiding spatial blocking and restricting input dimensions relative to FPGA on-chip memory. In this work we create a stencil accelerator using Intel FPGA SDK for OpenCL that achieves high performance without having such restrictions. We combine spatial and temporal blocking to avoid input size restrictions, and employ multiple FPGA-specific optimizations to tackle issues arisen from the added design complexity. Accelerator parameter tuning is guided by our performance model, which we also use to project performance for the upcoming Intel Stratix 10 devices. On an Arria 10 GX 1150 device, our accelerator can reach up to 760 and 375 GFLOP/s of compute performance, for 2D and 3D stencils, respectively, which rivals the performance of a highly-optimized GPU implementation. Furthermore, we estimate that the upcoming Stratix 10 devices can achieve a performance of up to 3.5 TFLOP/s and 1.6 TFLOP/s for 2D and 3D stencil computation, respectively.","PeriodicalId":164936,"journal":{"name":"Proceedings of the 2018 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays","volume":"40 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122156146","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 80
P4-Compatible High-Level Synthesis of Low Latency 100 Gb/s Streaming Packet Parsers in FPGAs fpga中低延迟100gb /s流数据包解析器的p4兼容高级合成
Jeferson Santiago da Silva, F. Boyer, J. Langlois
{"title":"P4-Compatible High-Level Synthesis of Low Latency 100 Gb/s Streaming Packet Parsers in FPGAs","authors":"Jeferson Santiago da Silva, F. Boyer, J. Langlois","doi":"10.1145/3174243.3174270","DOIUrl":"https://doi.org/10.1145/3174243.3174270","url":null,"abstract":"Packet parsing is a key step in SDN-aware devices. Packet parsers in SDN networks need to be both reconfigurable and fast, to support the evolving network protocols and the increasing multi-gigabit data rates. The combination of packet processing languages with FPGAs seems to be the perfect match for these requirements. In this work, we develop an open-source FPGA-based configurable architecture for arbitrary packet parsing to be used in SDN networks. We generate low latency and high-speed streaming packet parsers directly from a packet processing program. Our architecture is pipelined and entirely modeled using templated textttC++ classes. The pipeline layout is derived from a parser graph that corresponds to a P4 code after a series of graph transformation rounds. The RTL code is generated from the textttC++ description using Xilinx Vivado HLS and synthesized with Xilinx Vivado. Our architecture achieves a SI100 gigabit/second data rate in a Xilinx Virtex-7 FPGA while reducing the latency by 45% and the LUT usage by 40% compared to the state-of-the-art.","PeriodicalId":164936,"journal":{"name":"Proceedings of the 2018 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays","volume":"79 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-11-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124098410","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 31
Proceedings of the 2018 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays 2018 ACM/SIGDA现场可编程门阵列国际研讨会论文集
{"title":"Proceedings of the 2018 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays","authors":"","doi":"10.1145/3174243","DOIUrl":"https://doi.org/10.1145/3174243","url":null,"abstract":"","PeriodicalId":164936,"journal":{"name":"Proceedings of the 2018 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays","volume":"36 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1900-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132891872","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
相关产品
×
本文献相关产品
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信