使用赛灵思超大级内存级联实现FPGA覆盖noc

2017 IEEE 25th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM) Pub Date : 2017-04-01 DOI:10.1109/FCCM.2017.15

Nachiket Kapre

{"title":"使用赛灵思超大级内存级联实现FPGA覆盖noc","authors":"Nachiket Kapre","doi":"10.1109/FCCM.2017.15","DOIUrl":null,"url":null,"abstract":"We can enhance the performance and efficiency of deflection-routed FPGA overlay NoCs by exploiting the cascading featureof the Xilinx UltraScale BlockRAMs. This allows us to (1) hardenthe multiplexers in the NoC switch crossbars, and (2) efficientlyadd buffering support to deflection-routing. While buffering isnot required for correct operation of a deflection routed NoC, it can boost network throughputs for large system sizes underheavy load and allow functional support for fixed-length, multi-flit NoC traffic. Since the multiplexer controls of the cascadedRAMs can be driven from user-logic, the NoC routing functioncan be implementing in LUTs while the data is steered acrossthe dedicated cascade multiplexers and links. Thus, our approachuses hard resources in the BlockRAM architecture to absorb thebulk of the cost of a NoC in the form of crossbar multiplexing, as well as packet queuing. For the XCVU9P UltraScale+ FPGA, we show how to map the 72b Hoplite NoC router at a cost of 3FIFO blocks, 64 LUTs, and 40 FFs per switch while operating at ≈727 MHz (400 MHz in 60×12 grid). This reduces LUT count by1.4× and FF cost by 2× over a pure LUT-based implementationwhile also being 1.2× faster. For uniform RANDOM traffic, weboost throughputs of a 16×16 NoC by 50–60%, reduce worst-case packet latency by ≈40%, and lower energy use by 10–40%over classic bufferless deflection-routing at injection rates of 15–20% and higher with 16-deep buffers. When compared to hardNoC router designs, our BRAM-based soft NoC also closes thearea gap to under a factor of two instead of the 20–23× gapclaimed in earlier studies.","PeriodicalId":124631,"journal":{"name":"2017 IEEE 25th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)","volume":"47 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2017-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"8","resultStr":"{\"title\":\"Implementing FPGA Overlay NoCs Using the Xilinx UltraScale Memory Cascades\",\"authors\":\"Nachiket Kapre\",\"doi\":\"10.1109/FCCM.2017.15\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"We can enhance the performance and efficiency of deflection-routed FPGA overlay NoCs by exploiting the cascading featureof the Xilinx UltraScale BlockRAMs. This allows us to (1) hardenthe multiplexers in the NoC switch crossbars, and (2) efficientlyadd buffering support to deflection-routing. While buffering isnot required for correct operation of a deflection routed NoC, it can boost network throughputs for large system sizes underheavy load and allow functional support for fixed-length, multi-flit NoC traffic. Since the multiplexer controls of the cascadedRAMs can be driven from user-logic, the NoC routing functioncan be implementing in LUTs while the data is steered acrossthe dedicated cascade multiplexers and links. Thus, our approachuses hard resources in the BlockRAM architecture to absorb thebulk of the cost of a NoC in the form of crossbar multiplexing, as well as packet queuing. For the XCVU9P UltraScale+ FPGA, we show how to map the 72b Hoplite NoC router at a cost of 3FIFO blocks, 64 LUTs, and 40 FFs per switch while operating at ≈727 MHz (400 MHz in 60×12 grid). This reduces LUT count by1.4× and FF cost by 2× over a pure LUT-based implementationwhile also being 1.2× faster. For uniform RANDOM traffic, weboost throughputs of a 16×16 NoC by 50–60%, reduce worst-case packet latency by ≈40%, and lower energy use by 10–40%over classic bufferless deflection-routing at injection rates of 15–20% and higher with 16-deep buffers. When compared to hardNoC router designs, our BRAM-based soft NoC also closes thearea gap to under a factor of two instead of the 20–23× gapclaimed in earlier studies.\",\"PeriodicalId\":124631,\"journal\":{\"name\":\"2017 IEEE 25th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)\",\"volume\":\"47 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2017-04-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"8\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2017 IEEE 25th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/FCCM.2017.15\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2017 IEEE 25th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/FCCM.2017.15","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 8

摘要

通过利用赛灵思UltraScale BlockRAMs的级联特性，我们可以提高偏转路由FPGA覆盖noc的性能和效率。这使我们能够(1)强化NoC交换机交叉棒中的多路复用器，(2)有效地为偏转路由添加缓冲支持。虽然缓冲对于偏转路由NoC的正确运行不是必需的，但它可以提高大系统规模的网络吞吐量，并允许对固定长度、多飞位NoC流量的功能支持。由于级联dram的多路复用器控制可以由用户逻辑驱动，因此NoC路由功能可以在lut中实现，而数据则通过专用级联多路复用器和链路进行引导。因此，我们的方法是在BlockRAM架构中使用硬资源，以交叉多路复用的形式吸收NoC的大部分成本，以及数据包排队。对于XCVU9P UltraScale+ FPGA，我们展示了如何以每个开关3FIFO块，64 lut和40 ff的成本映射72b Hoplite NoC路由器，同时工作频率≈727 MHz (60×12网格中为400 MHz)。与纯基于LUT的实现相比，这减少了1.4倍的LUT计数和2倍的FF成本，同时速度也提高了1.2倍。对于均匀随机流量，我们将16×16 NoC的吞吐量提高了50-60%，将最坏情况下的数据包延迟降低了约40%，并且在注入速率为15-20%或更高的16深度缓冲区时，与传统的无缓冲偏折路由相比，能耗降低了10 - 40%。与硬NoC路由器设计相比，我们基于bram的软NoC也将面积差距缩小到两倍以下，而不是早期研究中声称的20 - 23倍差距。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Implementing FPGA Overlay NoCs Using the Xilinx UltraScale Memory Cascades

We can enhance the performance and efficiency of deflection-routed FPGA overlay NoCs by exploiting the cascading featureof the Xilinx UltraScale BlockRAMs. This allows us to (1) hardenthe multiplexers in the NoC switch crossbars, and (2) efficientlyadd buffering support to deflection-routing. While buffering isnot required for correct operation of a deflection routed NoC, it can boost network throughputs for large system sizes underheavy load and allow functional support for fixed-length, multi-flit NoC traffic. Since the multiplexer controls of the cascadedRAMs can be driven from user-logic, the NoC routing functioncan be implementing in LUTs while the data is steered acrossthe dedicated cascade multiplexers and links. Thus, our approachuses hard resources in the BlockRAM architecture to absorb thebulk of the cost of a NoC in the form of crossbar multiplexing, as well as packet queuing. For the XCVU9P UltraScale+ FPGA, we show how to map the 72b Hoplite NoC router at a cost of 3FIFO blocks, 64 LUTs, and 40 FFs per switch while operating at ≈727 MHz (400 MHz in 60×12 grid). This reduces LUT count by1.4× and FF cost by 2× over a pure LUT-based implementationwhile also being 1.2× faster. For uniform RANDOM traffic, weboost throughputs of a 16×16 NoC by 50–60%, reduce worst-case packet latency by ≈40%, and lower energy use by 10–40%over classic bufferless deflection-routing at injection rates of 15–20% and higher with 16-deep buffers. When compared to hardNoC router designs, our BRAM-based soft NoC also closes thearea gap to under a factor of two instead of the 20–23× gapclaimed in earlier studies.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2017 IEEE 25th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)

自引率

0.00%

发文量