Proceedings of the 2018 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays最新文献_第5页

Performance Comparison of Multiple Approaches of Status Register for Medium Density Memory Suitable for Implementation of a Lossless Compression Dictionary: (Abstract Only) 适合实现无损压缩字典的多种中密度存储器状态寄存器方法的性能比较(摘要)

Proceedings of the 2018 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays Pub Date : 2018-02-15 DOI: 10.1145/3174243.3174976

Matěj Bartík, S. Ubik, P. Kubalík, Tomás Benes

引用次数: 0

Performance Comparison of Multiples and Target Detection with Imager-driven Processing Mode for Ultrafast-Imager: (Abstract Only) 基于成像仪驱动处理模式的超快成像仪多倍和目标检测性能比较:(摘要)

Proceedings of the 2018 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays Pub Date : 2018-02-15 DOI: 10.1145/3174243.3174990

Xiaoyu Yu, D. Ye

{"title":"Performance Comparison of Multiples and Target Detection with Imager-driven Processing Mode for Ultrafast-Imager: (Abstract Only)","authors":"Xiaoyu Yu, D. Ye","doi":"10.1145/3174243.3174990","DOIUrl":"https://doi.org/10.1145/3174243.3174990","url":null,"abstract":"Latest vision tasks trend to be the real-time processing with high throughput frame rate and low latency. High spatiotemporal resolution imagers continue to spring up but only a few of them can be used in real applications owing to the excessive computational burden and lacking of suitable architecture. This paper presents a solution for target detection task in imager-driven processing mode (IMP), which takes shorter time in processing than the time gap between frames, even if the ulreafast imager run at full frame rate. High throughput pixel stream outputted from imager is analyzed base on multi features in a fully pipelined and bufferless architecture in FPGA. A pyramid shape model consisting of 2-D Processing Element (PE) array is proposed to search the connected regions of target candidates distributed at different time slices, and extract corresponding features when the stream pass through. A Label based 1-D PE Array collects the feature flow generated by the pyramid according to their labels, and output the feature vector of each target candidate in real time. The proposed model has been tested in simulation and experiments for target detection with 0.8Gpixel/sec (2320×1726 with 192FPS) data stream input, and the latency is less than 1 microsecond.","PeriodicalId":164936,"journal":{"name":"Proceedings of the 2018 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays","volume":"12 6","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-02-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132836845","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Understanding Performance Differences of FPGAs and GPUs: (Abtract Only) 了解fpga和gpu的性能差异:(摘要)

Proceedings of the 2018 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays Pub Date : 2018-02-15 DOI: 10.1145/3174243.3174970

J. Cong, Zhenman Fang, Michael Lo, Hanrui Wang, Jingxian Xu, Shaochong Zhang

{"title":"Understanding Performance Differences of FPGAs and GPUs: (Abtract Only)","authors":"J. Cong, Zhenman Fang, Michael Lo, Hanrui Wang, Jingxian Xu, Shaochong Zhang","doi":"10.1145/3174243.3174970","DOIUrl":"https://doi.org/10.1145/3174243.3174970","url":null,"abstract":"The notorious power wall has significantly limited the scaling for general-purpose processors. To address this issue, various accelerators, such as GPUs and FPGAs, emerged to achieve better performance and energy-efficiency. Between these two programmable accelerators, a natural question arises: which applications are better suited for FPGAs, which for GPUs, and why? In this paper, our goal is to better understand the performance differences between FPGAs and GPUs and provide more insights to the community. We intentionally start with a widely used GPU-friendly benchmark suite Rodinia, and port 11 of the benchmarks (15 kernels) onto FPGAs using the more portable and programmable high-level synthesis C. We provide a simple five-step strategy for FPGA accelerator designs that can be easily understood and mastered by software programmers, and present a quantitative performance breakdown of each step. Then we propose a set of performance metrics, including normalized operations per cycle (OPC_norm) for each pipeline, and effective parallel factor (effective_para_factor), to compare the performance of GPU and FPGA accelerator designs. We find that for 6 out of the 15 kernels, today's FPGAs can provide comparable performance or even achieve better performance, while only consume about 1/10 of GPUs' power (both on the same technology node). We observe that FPGAs usually have higher OPC_norm in most kernels in light of their customized deep pipeline but lower effective_para_factor due to far lower memory bandwidth than GPUs. Future FPGAs should increase their off-chip bandwidth and clock frequency to catch up with GPUs.","PeriodicalId":164936,"journal":{"name":"Proceedings of the 2018 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays","volume":"46 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-02-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131663016","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 9

Session details: Special Session: Deep Learning 专题会议:深度学习

Proceedings of the 2018 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays Pub Date : 2018-02-15 DOI: 10.1145/3252935

A. Ling

引用次数: 0

DeltaRNN: A Power-efficient Recurrent Neural Network Accelerator DeltaRNN:一种高效的循环神经网络加速器

Proceedings of the 2018 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays Pub Date : 2018-02-15 DOI: 10.1145/3174243.3174261

Chang Gao, Daniel Neil, Enea Ceolini, Shih-Chii Liu, T. Delbrück

引用次数: 108

Improving FPGA Performance with a S44 LUT Structure 利用S44 LUT结构改进FPGA性能

Proceedings of the 2018 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays Pub Date : 2018-02-15 DOI: 10.1145/3174243.3174272

Wenyi Feng, J. Greene, A. Mishchenko

引用次数: 20

Session details: Session 7: Circuits and Computation Engines 会议详情:第七部分:电路和计算引擎

Proceedings of the 2018 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays Pub Date : 2018-02-15 DOI: 10.1145/3252942

Nachiket Kapre

引用次数: 0

Dynamically Scheduled High-level Synthesis 动态安排的高级综合

Proceedings of the 2018 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays Pub Date : 2018-02-15 DOI: 10.1145/3174243.3174264

Lana Josipović, Radhika Ghosal, P. Ienne

{"title":"Dynamically Scheduled High-level Synthesis","authors":"Lana Josipović, Radhika Ghosal, P. Ienne","doi":"10.1145/3174243.3174264","DOIUrl":"https://doi.org/10.1145/3174243.3174264","url":null,"abstract":"High-level synthesis (HLS) tools almost universally generate statically scheduled datapaths. Static scheduling implies that circuits out of HLS tools have a hard time exploiting parallelism in code with potential memory dependencies, with control-dependent dependencies in inner loops, or where performance is limited by long latency control decisions. The situation is essentially the same as in computer architecture between Very-Long Instruction Word (VLIW) processors and dynamically scheduled superscalar processors; the former display the best performance per cost in highly regular embedded applications, but general purpose, irregular, and control-dominated computing tasks require the runtime flexibility of dynamic scheduling. In this work, we show that high-level synthesis of dynamically scheduled circuits is perfectly feasible by describing the implementation of a prototype synthesizer which generates a particular form of latency-insensitive synchronous circuits. Compared to a commercial HLS tool, the result is a different trade-off between performance and circuit complexity, much as superscalar processors represent a different trade-off compared to VLIW processors: in demanding applications, the performance is very significantly improved at an affordable cost. We here demonstrate only the first steps towards more performant high-level synthesis tools adapted to emerging FPGA applications and the demands of computing in broader application domains.","PeriodicalId":164936,"journal":{"name":"Proceedings of the 2018 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays","volume":"66 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-02-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130115831","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 72

C-LSTM: Enabling Efficient LSTM using Structured Compression Techniques on FPGAs C-LSTM:在fpga上使用结构化压缩技术实现高效LSTM

Proceedings of the 2018 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays Pub Date : 2018-02-15 DOI: 10.1145/3174243.3174253

Shuo Wang, Zhe Li, Caiwen Ding, Bo Yuan, Qinru Qiu, Yanzhi Wang, Yun Liang

{"title":"C-LSTM: Enabling Efficient LSTM using Structured Compression Techniques on FPGAs","authors":"Shuo Wang, Zhe Li, Caiwen Ding, Bo Yuan, Qinru Qiu, Yanzhi Wang, Yun Liang","doi":"10.1145/3174243.3174253","DOIUrl":"https://doi.org/10.1145/3174243.3174253","url":null,"abstract":"Recently, significant accuracy improvement has been achieved for acoustic recognition systems by increasing the model size of Long Short-Term Memory (LSTM) networks. Unfortunately, the ever-increasing size of LSTM model leads to inefficient designs on FPGAs due to the limited on-chip resources. The previous work proposes to use a pruning based compression technique to reduce the model size and thus speedups the inference on FPGAs. However, the random nature of the pruning technique transforms the dense matrices of the model to highly unstructured sparse ones, which leads to unbalanced computation and irregular memory accesses and thus hurts the overall performance and energy efficiency. In contrast, we propose to use a structured compression technique which could not only reduce the LSTM model size but also eliminate the irregularities of computation and memory accesses. This approach employs block-circulant instead of sparse matrices to compress weight matrices and reduces the storage requirement from $mathcalO (k^2)$ to $mathcalO (k)$. Fast Fourier Transform algorithm is utilized to further accelerate the inference by reducing the computational complexity from $mathcalO (k^2)$ to $mathcalO (ktextlog k)$. The datapath and activation functions are quantized as 16-bit to improve the resource utilization. More importantly, we propose a comprehensive framework called C-LSTM to automatically optimize and implement a wide range of LSTM variants on FPGAs. According to the experimental results, C-LSTM achieves up to 18.8X and 33.5X gains for performance and energy efficiency compared with the state-of-the-art LSTM implementation under the same experimental setup, and the accuracy degradation is very small.","PeriodicalId":164936,"journal":{"name":"Proceedings of the 2018 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays","volume":"15 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-02-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131944511","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 178

ADAM: Automated Design Analysis and Merging for Speeding up FPGA Development ADAM:加速FPGA开发的自动化设计分析和合并

Proceedings of the 2018 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays Pub Date : 2018-02-15 DOI: 10.1145/3174243.3174247

Ho-Cheung Ng, Shuanglong Liu, W. Luk

引用次数: 9