当你可以一起工作时，为什么要竞争:持久rnn的FPGA-ASIC集成

2019 IEEE 27th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM) Pub Date : 2019-04-01 DOI:10.1109/FCCM.2019.00035

E. Nurvitadhi, Dongup Kwon, A. Jafari, Andrew Boutros, Jaewoong Sim, Phil Tomson, H. Sumbul, Gregory K. Chen, Phil V. Knag, Raghavan Kumar, R. Krishnamurthy, Sergey Gribok, B. Pasca, M. Langhammer, Debbie Marr, A. Dasu

{"title":"当你可以一起工作时，为什么要竞争:持久rnn的FPGA-ASIC集成","authors":"E. Nurvitadhi, Dongup Kwon, A. Jafari, Andrew Boutros, Jaewoong Sim, Phil Tomson, H. Sumbul, Gregory K. Chen, Phil V. Knag, Raghavan Kumar, R. Krishnamurthy, Sergey Gribok, B. Pasca, M. Langhammer, Debbie Marr, A. Dasu","doi":"10.1109/FCCM.2019.00035","DOIUrl":null,"url":null,"abstract":"Interactive intelligent services, such as smart web search, are important datacenter workloads. They rely on dataintensive deep learning (DL) algorithms with strict latency constraints and thus require balancing both data movement and compute capabilities. As such, a persistent approach that keeps the entire DL model on-chip is becoming the new norm for realtime services to avoid the expensive off-chip memory accesses. This approach is adopted in Microsoft's Brainwave and is also provided by Nvidia's cuDNN libraries. This paper presents a comparative study of FPGA, GPU, and FPGA+ASIC in-package solutions for persistent DL. Unlike prior work, we offer a fair and direct comparison targeting common numerical precisions (FP32, INT8) and modern high-end FPGA (Intel® Stratix®10), GPU (Nvidia Volta), and ASIC (10 nm process), all using the persistent approach. We show that Stratix 10 FPGAs offer 2.7× (FP32) to 8.6× (INT8) lower latency than Volta GPUs across RNN, GRU, and LSTM workloads from DeepBench. The GPU can only utilize ~6% of its peak TOPS, while the FPGA with a more balanced on-chip memory and compute can achieve much higher utilization (~57%). We also study integrating an ASIC chiplet, TensorRAM, with an FPGA as system-in-package to enhance on-chip memory capacity and bandwidth, and provide compute throughput matching the required bandwidth. We show that a small 32 mm2 TensorRAM 10nm chiplet can offer 64 MB memory, 32 TB/s on-chiplet bandwidth, and 64 TOPS (INT8). A small Stratix 10 FPGA with a TensorRAM (INT8) offers 15.9× better latency than GPU (FP32) and 34× higher energy efficiency. It has 2× aggregate on-chip memory capacity compared to a large FPGA or GPU. Overall, our study shows that the FPGA is better than the GPU for persistent DL, and when integrated with an ASIC chiplet, it can offer a more compelling solution.","PeriodicalId":116955,"journal":{"name":"2019 IEEE 27th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)","volume":"85 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2019-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"38","resultStr":"{\"title\":\"Why Compete When You Can Work Together: FPGA-ASIC Integration for Persistent RNNs\",\"authors\":\"E. Nurvitadhi, Dongup Kwon, A. Jafari, Andrew Boutros, Jaewoong Sim, Phil Tomson, H. Sumbul, Gregory K. Chen, Phil V. Knag, Raghavan Kumar, R. Krishnamurthy, Sergey Gribok, B. Pasca, M. Langhammer, Debbie Marr, A. Dasu\",\"doi\":\"10.1109/FCCM.2019.00035\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Interactive intelligent services, such as smart web search, are important datacenter workloads. They rely on dataintensive deep learning (DL) algorithms with strict latency constraints and thus require balancing both data movement and compute capabilities. As such, a persistent approach that keeps the entire DL model on-chip is becoming the new norm for realtime services to avoid the expensive off-chip memory accesses. This approach is adopted in Microsoft's Brainwave and is also provided by Nvidia's cuDNN libraries. This paper presents a comparative study of FPGA, GPU, and FPGA+ASIC in-package solutions for persistent DL. Unlike prior work, we offer a fair and direct comparison targeting common numerical precisions (FP32, INT8) and modern high-end FPGA (Intel® Stratix®10), GPU (Nvidia Volta), and ASIC (10 nm process), all using the persistent approach. We show that Stratix 10 FPGAs offer 2.7× (FP32) to 8.6× (INT8) lower latency than Volta GPUs across RNN, GRU, and LSTM workloads from DeepBench. The GPU can only utilize ~6% of its peak TOPS, while the FPGA with a more balanced on-chip memory and compute can achieve much higher utilization (~57%). We also study integrating an ASIC chiplet, TensorRAM, with an FPGA as system-in-package to enhance on-chip memory capacity and bandwidth, and provide compute throughput matching the required bandwidth. We show that a small 32 mm2 TensorRAM 10nm chiplet can offer 64 MB memory, 32 TB/s on-chiplet bandwidth, and 64 TOPS (INT8). A small Stratix 10 FPGA with a TensorRAM (INT8) offers 15.9× better latency than GPU (FP32) and 34× higher energy efficiency. It has 2× aggregate on-chip memory capacity compared to a large FPGA or GPU. Overall, our study shows that the FPGA is better than the GPU for persistent DL, and when integrated with an ASIC chiplet, it can offer a more compelling solution.\",\"PeriodicalId\":116955,\"journal\":{\"name\":\"2019 IEEE 27th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)\",\"volume\":\"85 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2019-04-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"38\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2019 IEEE 27th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/FCCM.2019.00035\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2019 IEEE 27th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/FCCM.2019.00035","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 38

摘要

交互式智能服务(如智能web搜索)是重要的数据中心工作负载。它们依赖于具有严格延迟约束的数据密集型深度学习(DL)算法，因此需要平衡数据移动和计算能力。因此，将整个DL模型保持在片上的持久方法正在成为实时服务的新标准，以避免昂贵的片外内存访问。微软的Brainwave采用了这种方法，Nvidia的cuDNN库也提供了这种方法。本文对FPGA、GPU和FPGA+ASIC封装方案进行了比较研究。与之前的工作不同，我们提供了针对常见数值精度(FP32, INT8)和现代高端FPGA (Intel®Stratix®10)，GPU (Nvidia Volta)和ASIC (10nm工艺)的公平和直接的比较，所有这些都使用持久方法。我们表明，在来自DeepBench的RNN、GRU和LSTM工作负载上，Stratix 10 fpga比Volta gpu的延迟低2.7倍(FP32)到8.6倍(INT8)。GPU只能利用其峰值TOPS的6%，而具有更平衡的片上内存和计算的FPGA可以实现更高的利用率(~57%)。我们还研究将ASIC芯片TensorRAM与FPGA集成为系统级封装，以增强片上存储器容量和带宽，并提供与所需带宽匹配的计算吞吐量。我们展示了一个小的32 mm2 TensorRAM 10nm芯片可以提供64 MB内存，32 TB/s片上带宽和64 TOPS (INT8)。一个带有TensorRAM (INT8)的小型Stratix 10 FPGA比GPU (FP32)提供15.9倍的延迟和34倍的能效。与大型FPGA或GPU相比，它具有2倍的片上内存容量。总的来说，我们的研究表明FPGA比GPU更适合持久DL，并且当与ASIC芯片集成时，它可以提供更引人注目的解决方案。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Why Compete When You Can Work Together: FPGA-ASIC Integration for Persistent RNNs

Interactive intelligent services, such as smart web search, are important datacenter workloads. They rely on dataintensive deep learning (DL) algorithms with strict latency constraints and thus require balancing both data movement and compute capabilities. As such, a persistent approach that keeps the entire DL model on-chip is becoming the new norm for realtime services to avoid the expensive off-chip memory accesses. This approach is adopted in Microsoft's Brainwave and is also provided by Nvidia's cuDNN libraries. This paper presents a comparative study of FPGA, GPU, and FPGA+ASIC in-package solutions for persistent DL. Unlike prior work, we offer a fair and direct comparison targeting common numerical precisions (FP32, INT8) and modern high-end FPGA (Intel® Stratix®10), GPU (Nvidia Volta), and ASIC (10 nm process), all using the persistent approach. We show that Stratix 10 FPGAs offer 2.7× (FP32) to 8.6× (INT8) lower latency than Volta GPUs across RNN, GRU, and LSTM workloads from DeepBench. The GPU can only utilize ~6% of its peak TOPS, while the FPGA with a more balanced on-chip memory and compute can achieve much higher utilization (~57%). We also study integrating an ASIC chiplet, TensorRAM, with an FPGA as system-in-package to enhance on-chip memory capacity and bandwidth, and provide compute throughput matching the required bandwidth. We show that a small 32 mm2 TensorRAM 10nm chiplet can offer 64 MB memory, 32 TB/s on-chiplet bandwidth, and 64 TOPS (INT8). A small Stratix 10 FPGA with a TensorRAM (INT8) offers 15.9× better latency than GPU (FP32) and 34× higher energy efficiency. It has 2× aggregate on-chip memory capacity compared to a large FPGA or GPU. Overall, our study shows that the FPGA is better than the GPU for persistent DL, and when integrated with an ASIC chiplet, it can offer a more compelling solution.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2019 IEEE 27th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)

自引率

0.00%

发文量