Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays最新文献_第7页

Don't Forget the Memory: Automatic Block RAM Modelling, Optimization, and Architecture Exploration 不要忘记内存:自动块RAM建模，优化和架构探索

Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays Pub Date : 2017-02-22 DOI: 10.1145/3020078.3021731

S. Yazdanshenas, K. Tatsumura, Vaughn Betz

引用次数: 27

Synchronization Constraints for Interconnect Synthesis 互连综合的同步约束

Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays Pub Date : 2017-02-22 DOI: 10.1145/3020078.3021729

A. Rodionov, Jonathan Rose

引用次数: 3

An Energy-Efficient Design-Time Scheduler for FPGAs Leveraging Dynamic Frequency Scaling Emulation (Abstract Only) 基于动态频率缩放仿真的fpga节能设计时间调度器(仅摘要)

Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays Pub Date : 2017-02-22 DOI: 10.1145/3020078.3021805

W. Loke, Chin Yang Koay

引用次数: 0

Energy Efficient Scientific Computing on FPGAs using OpenCL 基于OpenCL的fpga节能科学计算

Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays Pub Date : 2017-02-22 DOI: 10.1145/3020078.3021730

Dennis D. Weller, Fabian Oboril, D. Lukarski, J. Becker, M. Tahoori

{"title":"Energy Efficient Scientific Computing on FPGAs using OpenCL","authors":"Dennis D. Weller, Fabian Oboril, D. Lukarski, J. Becker, M. Tahoori","doi":"10.1145/3020078.3021730","DOIUrl":"https://doi.org/10.1145/3020078.3021730","url":null,"abstract":"An indispensable part of our modern life is scientific computing which is used in large-scale high-performance systems as well as in low-power smart cyber-physical systems. Hence, accelerators for scientific computing need to be fast and energy efficient. Therefore, partial differential equations (PDEs), as an integral component of many scientific computing tasks, require efficient implementation. In this regard, FPGAs are well suited for data-parallel computations as they occur in PDE solvers. However, including FPGAs in the programming flow is not trivial, as hardware description languages (HDLs) have to be exploited, which requires detailed knowledge of the underlying hardware. This issue is tackled by OpenCL, which allows to write standardized code in a C-like fashion, rendering experience with HDLs unnecessary. Yet, hiding the underlying hardware from the developer makes it challenging to implement solvers that exploit the full FPGA potential. Therefore, we propose in this work a comprehensive set of generic and specific optimization techniques for PDE solvers using OpenCL that improve the FPGA performance and energy efficiency by orders of magnitude. Based on these optimizations, our study shows that, despite the high abstraction level of OpenCL, very energy efficient PDE accelerators on the FPGA fabric can be designed, making the FPGA an ideal solution for power-constrained applications.","PeriodicalId":252039,"journal":{"name":"Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays","volume":"52 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-02-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129771854","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 43

Session details: Interconnect and Routing 会话详细信息:互连和路由

Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays Pub Date : 2017-02-22 DOI: 10.1145/3257185

S. Kaptanoglu

引用次数: 0

FPGA-Accelerated Transactional Execution of Graph Workloads 图形工作负载的fpga加速事务性执行

Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays Pub Date : 2017-02-22 DOI: 10.1145/3020078.3021743

Xiaoyu Ma, Dan Zhang, Derek Chiou

{"title":"FPGA-Accelerated Transactional Execution of Graph Workloads","authors":"Xiaoyu Ma, Dan Zhang, Derek Chiou","doi":"10.1145/3020078.3021743","DOIUrl":"https://doi.org/10.1145/3020078.3021743","url":null,"abstract":"Many applications that operate on large graphs can be intuitively parallelized by executing a large number of the graph operations concurrently and as transactions to deal with potential conflicts. However, large numbers of operations occurring concurrently might incur too many conflicts that would negate the potential benefits of the parallelization which has probably made highly multi-threaded transactional machines seem impractical. Given the large size and topology of many modern graphs, however, such machines can provide real performance, energy efficiency, and programability benefits. This paper describes an architecture that consists of many lightweight multi-threaded processing engines, a global transactional shared memory, and a work scheduler. We present challenges of realizing such an architecture, especially the requirement of scalable conflict detection, and propose solutions. We also argue that despite increased transaction conflicts due to the higher concurrency and single-thread latency, scalable speedup over serial execution can be achieved. We implement the proposed architecture as a synthesizable FPGA RTL design and demonstrate improved per-socket performance (2X) and energy efficiency (22X) by comparing to a baseline platform that contains two Intel Haswell processors, each with 12 cores.","PeriodicalId":252039,"journal":{"name":"Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays","volume":"12 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-02-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126447898","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 27

Session details: Applications 会话详细信息:应用

Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays Pub Date : 2017-02-22 DOI: 10.1145/3257192

M. Leeser

引用次数: 0

A 7.663-TOPS 8.2-W Energy-efficient FPGA Accelerator for Binary Convolutional Neural Networks (Abstract Only) 一种用于二进制卷积神经网络的7.663-TOPS 8.2 w节能FPGA加速器(仅摘要)

Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays Pub Date : 2017-02-20 DOI: 10.1145/3020078.3021786

Yixing Li, Zichuan Liu, Kai Xu, Hao Yu, Fengbo Ren

{"title":"A 7.663-TOPS 8.2-W Energy-efficient FPGA Accelerator for Binary Convolutional Neural Networks (Abstract Only)","authors":"Yixing Li, Zichuan Liu, Kai Xu, Hao Yu, Fengbo Ren","doi":"10.1145/3020078.3021786","DOIUrl":"https://doi.org/10.1145/3020078.3021786","url":null,"abstract":"FPGA-based hardware accelerator for convolutional neural networks (CNNs) has obtained great attentions due to its higher energy efficiency than GPUs. However, it has been a challenge for FPGA-based solutions to achieve a higher throughput than GPU counterparts. In this paper, we demonstrate that FPGA acceleration can be a superior solution in terms of both throughput and energy efficiency when a CNN is trained with binary constraints on weights and activations. Specifically, we propose an optimized accelerator architecture tailored for bitwise convolution and normalization that features massive spatial parallelism with deep pipeline (temporal parallelism) stages. Experiment results show that the proposed architecture running at 90 MHz on a Xilinx Virtex-7 FPGA achieves a computing throughput of 7.663 TOPS with a power consumption of 8.2 W regardless of the batch size of input data. This is 8.3x faster and 75x more energy-efficient than a Titan X GPU for processing online individual requests (in small batch size). For processing static data (in large batch size), the proposed solution is on a par with a Titan X GPU in terms of throughput while delivering 9.5x higher energy efficiency.","PeriodicalId":252039,"journal":{"name":"Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays","volume":"18 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-02-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127778087","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 38

An OpenCL™ Deep Learning Accelerator on Arria 10

Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays Pub Date : 2017-01-13 DOI: 10.1145/3020078.3021738

U. Aydonat, Shane O'Connell, D. Capalija, A. Ling, Gordon R. Chiu

{"title":"An OpenCL™ Deep Learning Accelerator on Arria 10","authors":"U. Aydonat, Shane O'Connell, D. Capalija, A. Ling, Gordon R. Chiu","doi":"10.1145/3020078.3021738","DOIUrl":"https://doi.org/10.1145/3020078.3021738","url":null,"abstract":"Convolutional neural nets (CNNs) have become a practical means to perform vision tasks, particularly in the area of image classification. FPGAs are well known to be able to perform convolutions efficiently, however, most recent efforts to run CNNs on FPGAs have shown limited advantages over other devices such as GPUs. Previous approaches on FPGAs have often been memory bound due to the limited external memory bandwidth on the FPGA device. We show a novel architecture written in OpenCL(TM), which we refer to as a Deep Learning Accelerator (DLA), that maximizes data reuse and minimizes external memory bandwidth. Furthermore, we show how we can use the Winograd transform to significantly boost the performance of the FPGA. As a result, when running our DLA on Intel's Arria 10 device we can achieve a performance of 1020 img/s, or 23 img/s/W when running the AlexNet CNN benchmark. This comes to 1382 GFLOPs and is 10x faster with 8.4x more GFLOPS and 5.8x better efficiency than the state-of-the-art on FPGAs. Additionally, 23 img/s/W is competitive against the best publicly known implementation of AlexNet on nVidia's TitanX GPU.","PeriodicalId":252039,"journal":{"name":"Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays","volume":"90 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-01-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124237659","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 234

ESE: Efficient Speech Recognition Engine with Sparse LSTM on FPGA 基于FPGA的高效稀疏LSTM语音识别引擎

Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays Pub Date : 2016-12-01 DOI: 10.1145/3020078.3021745

Song Han, Junlong Kang, Huizi Mao, Yiming Hu, Xin Li, Yubin Li, Dongliang Xie, Hong Luo, Song Yao, Yu Wang, Huazhong Yang, W. Dally

{"title":"ESE: Efficient Speech Recognition Engine with Sparse LSTM on FPGA","authors":"Song Han, Junlong Kang, Huizi Mao, Yiming Hu, Xin Li, Yubin Li, Dongliang Xie, Hong Luo, Song Yao, Yu Wang, Huazhong Yang, W. Dally","doi":"10.1145/3020078.3021745","DOIUrl":"https://doi.org/10.1145/3020078.3021745","url":null,"abstract":"Long Short-Term Memory (LSTM) is widely used in speech recognition. In order to achieve higher prediction accuracy, machine learning scientists have built increasingly larger models. Such large model is both computation intensive and memory intensive. Deploying such bulky model results in high power consumption and leads to a high total cost of ownership (TCO) of a data center. To speedup the prediction and make it energy efficient, we first propose a load-balance-aware pruning method that can compress the LSTM model size by 20x (10x from pruning and 2x from quantization) with negligible loss of the prediction accuracy. The pruned model is friendly for parallel processing. Next, we propose a scheduler that encodes and partitions the compressed model to multiple PEs for parallelism and schedule the complicated LSTM data flow. Finally, we design the hardware architecture, named Efficient Speech Recognition Engine (ESE) that works directly on the sparse LSTM model. Implemented on Xilinx KU060 FPGA running at 200MHz, ESE has a performance of 282 GOPS working directly on the sparse LSTM network, corresponding to 2.52 TOPS on the dense one, and processes a full LSTM for speech recognition with a power dissipation of 41 Watts. Evaluated on the LSTM for speech recognition benchmark, ESE is 43x and 3x faster than Core i7 5930k CPU and Pascal Titan X GPU implementations. It achieves 40x and 11.5x higher energy efficiency compared with the CPU and GPU respectively.","PeriodicalId":252039,"journal":{"name":"Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays","volume":"33 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127531116","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 570