Proceedings of the 2018 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays最新文献_第7页

Continuous Skyline Computation Accelerator with Parallelizing Dominance Relation Calculations: (Abstract Only) 具有并行优势关系计算的连续Skyline计算加速器(仅摘要)

Proceedings of the 2018 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays Pub Date : 2018-02-15 DOI: 10.1145/3174243.3174961

Kenichi Koizumi, K. Hiraki, M. Inaba

{"title":"Continuous Skyline Computation Accelerator with Parallelizing Dominance Relation Calculations: (Abstract Only)","authors":"Kenichi Koizumi, K. Hiraki, M. Inaba","doi":"10.1145/3174243.3174961","DOIUrl":"https://doi.org/10.1145/3174243.3174961","url":null,"abstract":"Skyline Computation is a method for extracting interesting entries from a large population with multiple attributes. These entries, called skyline or Pareto optimal entries, are known to have extreme characteristics that cannot be found by using outlier detection methods. Skyline computation is an important task for characterizing large amounts of data and selecting interesting entries with extreme features. When the population changes dynamically, the task of calculating a sequence of skyline sets is called a continuous skyline computation. This task is known to be difficult for the following reasons: (1) information must be kept for non-skyline entries, since they may join the skyline in the future; (2) the appearance or disappearance of even a single entry can change the skyline drastically; and (3) it is difficult to adopt a geometric acceleration algorithm for skyline computation tasks with high-dimensional datasets. A new algorithm, called jointed rooted-tree (JR-tree), has been developed that manages entries using a rooted-tree structure. JR-tree delays extend the tree to deeper levels to accelerate tree construction and traversal. In this study, we propose the JR-tree based continuous skyline computation acceleration algorithm. Our hardware algorithm parallelizes the calculations of dominance relation between a target entry and the skyline entries. We implemented our hardware algorithm on an FPGA and showed that high-speed tree construction and traversal can be realized. Comparing our FPGA-based implementation with an Intel CPU running state-of-the-art software algorithms, it was found to reduce the query processing time for synthetic and real-world datasets. Our hardware implementation is 1.7x to 35x faster than the software implementations.","PeriodicalId":164936,"journal":{"name":"Proceedings of the 2018 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays","volume":"385 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-02-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114899454","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Automatic Optimising CNN with Depthwise Separable Convolution on FPGA: (Abstact Only) 基于FPGA的深度可分离卷积自动优化CNN(摘要)

Proceedings of the 2018 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays Pub Date : 2018-02-15 DOI: 10.1145/3174243.3174959

Ruizhe Zhao, Xinyu Niu, W. Luk

{"title":"Automatic Optimising CNN with Depthwise Separable Convolution on FPGA: (Abstact Only)","authors":"Ruizhe Zhao, Xinyu Niu, W. Luk","doi":"10.1145/3174243.3174959","DOIUrl":"https://doi.org/10.1145/3174243.3174959","url":null,"abstract":"Convolution layers in Convolutional Neural Networks (CNNs) are effective in vision feature extraction but quite inefficient in computational resource usage. Depthwise separable convolution layer has been proposed in recent publications to enhance the efficiency without reducing the effectiveness by separately computing the spatial and cross-channel correlations from input images and has proven successful in state-of-the-art networks such as MobileNets [1] and Xception [2]. Based on the facts that depthwise separable convolution is highly structured and uses limited resources, we argue that it can well fit reconfigurable platforms like FPGA. To benefit FPGA platforms with this new layer, in this paper, we present a novel framework that can automatically generate and optimise hardware designs for depthwise separable CNNs. Besides, in our framework, existing conventional CNNs can be systematically converted to ones whose standard convolution layers are selectively replaced with functionally identical depthwise separable convolution layers, by carefully balancing the trade-off among speed, accuracy, and resource usage through resource usage modelling and network fine-tuning. Results show that hardware designs generated by our framework can reach at most 231.7 frames per second regarding MobileNets, and for VGG-16 [3], we gain 3.43 times speed-up and 3.54% accuracy decrease on the ImageNet [4] dataset comparing the original model and a layer replaced one.","PeriodicalId":164936,"journal":{"name":"Proceedings of the 2018 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays","volume":"9 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-02-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127688616","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 20

Solving Satisfiability Problem on Quantum Annealer: A Lesson from FPGA CAD Tools: (Abstract Only) 求解量子退火的可满足性问题:FPGA CAD工具的启示(摘要)

Proceedings of the 2018 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays Pub Date : 2018-02-15 DOI: 10.1145/3174243.3174972

J. Su, Lei He

引用次数: 0

A Scalable Approach to Exact Resource-Constrained Scheduling Based on a Joint SDC and SAT Formulation 基于SDC和SAT联合公式的可扩展精确资源约束调度方法

Proceedings of the 2018 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays Pub Date : 2018-02-15 DOI: 10.1145/3174243.3174268

Steve Dai, Gai Liu, Zhiru Zhang

{"title":"A Scalable Approach to Exact Resource-Constrained Scheduling Based on a Joint SDC and SAT Formulation","authors":"Steve Dai, Gai Liu, Zhiru Zhang","doi":"10.1145/3174243.3174268","DOIUrl":"https://doi.org/10.1145/3174243.3174268","url":null,"abstract":"Despite increasing adoption of high-level synthesis (HLS) for its design productivity advantage, success in achieving high quality-of-results out-of-the-box is often hindered by the inexactness of the common HLS optimizations. In particular, while scheduling forms the algorithmic core to HLS technology, current scheduling algorithms rely heavily on fundamentally inexact heuristics that make ad hoc local decisions and cannot accurately and globally optimize over a rich set of constraints. To tackle this challenge, we propose a scheduling formulation based on system of integer difference constraints (SDC) and Boolean satisfiability (SAT) to exactly handle a variety of scheduling constraints. We develop a specialized scheduler based on conflict-driven learning and problem-specific knowledge to optimally and efficiently solve the resource-constrained scheduling problem. By leveraging the efficiency of SDC algorithms and scalability of modern SAT solvers, our scheduling technique is able to achieve on average over 100x improvement in runtime over the integer linear programming (ILP) approach while attaining optimal latency. By integrating our scheduling formulation into a state-of-the-art open-source HLS tool, we further demonstrate the applicability of our scheduling technique with a suite of representative benchmarks targeting FPGAs.","PeriodicalId":164936,"journal":{"name":"Proceedings of the 2018 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays","volume":"17 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-02-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132925878","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 14

FPGAs in the Datacenters: the Case of Parallel Hybrid Super Scalar String Sample Sort (pHS5)(Abstract Only) fpga在数据中心中的应用:并行混合标量字符串样本排序(pHS5)(仅摘要)

Proceedings of the 2018 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays Pub Date : 2018-02-15 DOI: 10.1145/3174243.3174993

Mikhail Asiatici, Damian Maiorano, P. Ienne

{"title":"FPGAs in the Datacenters: the Case of Parallel Hybrid Super Scalar String Sample Sort (pHS5)(Abstract Only)","authors":"Mikhail Asiatici, Damian Maiorano, P. Ienne","doi":"10.1145/3174243.3174993","DOIUrl":"https://doi.org/10.1145/3174243.3174993","url":null,"abstract":"String sorting is an important part of database and MapReduce applications; however, it has not been studied as extensively as sorting of fixed-length keys. Handling variable-length keys in hardware is challenging and it is no surprise that no string sorters on FPGA have been proposed yet. We present Parallel Hybrid Super Scalar String Sample Sort (pHS5) on Intel HARPv2, a heterogeneous CPU-FPGA system with a server-grade multi-core CPU. Our pHS5 is based on the state-of-the-art string sorting algorithm for multi-core shared memory CPUs, pS5, which we extended with multiple processing elements (PEs) on the FPGA. Each PE accelerates one instance of the most effectively parallelizable dominant kernel of pS5 by up to 33% compared to a single Intel Xeon Broadwell core running at 3.4 GHz. Furthermore, we extended the job scheduling mechanism of pS5 to enable our PEs to compete with the CPU cores for processing the accelerable kernel, while retaining the complex high-level control flow and the sorting of the smaller data sets on the CPU. We accelerate the whole algorithm by up to 10% compared to the 28 thread software baseline running on the 14-core Xeon processor and by up to 36% at lower thread counts.","PeriodicalId":164936,"journal":{"name":"Proceedings of the 2018 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays","volume":"13 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-02-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127114012","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Combined Spatial and Temporal Blocking for High-Performance Stencil Computation on FPGAs Using OpenCL 基于OpenCL的fpga高性能模板计算空间和时间组合块

Proceedings of the 2018 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays Pub Date : 2018-02-01 DOI: 10.1145/3174243.3174248

H. Zohouri, Artur Podobas, S. Matsuoka

{"title":"Combined Spatial and Temporal Blocking for High-Performance Stencil Computation on FPGAs Using OpenCL","authors":"H. Zohouri, Artur Podobas, S. Matsuoka","doi":"10.1145/3174243.3174248","DOIUrl":"https://doi.org/10.1145/3174243.3174248","url":null,"abstract":"Recent developments in High Level Synthesis tools have attracted software programmers to accelerate their high-performance computing applications on FPGAs. Even though it has been shown that FPGAs can compete with GPUs in terms of performance for stencil computation, most previous work achieve this by avoiding spatial blocking and restricting input dimensions relative to FPGA on-chip memory. In this work we create a stencil accelerator using Intel FPGA SDK for OpenCL that achieves high performance without having such restrictions. We combine spatial and temporal blocking to avoid input size restrictions, and employ multiple FPGA-specific optimizations to tackle issues arisen from the added design complexity. Accelerator parameter tuning is guided by our performance model, which we also use to project performance for the upcoming Intel Stratix 10 devices. On an Arria 10 GX 1150 device, our accelerator can reach up to 760 and 375 GFLOP/s of compute performance, for 2D and 3D stencils, respectively, which rivals the performance of a highly-optimized GPU implementation. Furthermore, we estimate that the upcoming Stratix 10 devices can achieve a performance of up to 3.5 TFLOP/s and 1.6 TFLOP/s for 2D and 3D stencil computation, respectively.","PeriodicalId":164936,"journal":{"name":"Proceedings of the 2018 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays","volume":"40 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122156146","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 80

P4-Compatible High-Level Synthesis of Low Latency 100 Gb/s Streaming Packet Parsers in FPGAs fpga中低延迟100gb /s流数据包解析器的p4兼容高级合成

Proceedings of the 2018 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays Pub Date : 2017-11-17 DOI: 10.1145/3174243.3174270

Jeferson Santiago da Silva, F. Boyer, J. Langlois

引用次数: 31

Proceedings of the 2018 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays 2018 ACM/SIGDA现场可编程门阵列国际研讨会论文集

Proceedings of the 2018 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays Pub Date : 1900-01-01 DOI: 10.1145/3174243

引用次数: 0