HopliteML:发展应用定制FPGA noc与自适应路由器和调节器

ACM Transactions on Reconfigurable Technology and Systems (TRETS) Pub Date : 2022-02-14 DOI:10.1145/3507699

G. Malik, Ian Elmor Lang, R. Pellizzoni, Nachiket Kapre

{"title":"HopliteML:发展应用定制FPGA noc与自适应路由器和调节器","authors":"G. Malik, Ian Elmor Lang, R. Pellizzoni, Nachiket Kapre","doi":"10.1145/3507699","DOIUrl":null,"url":null,"abstract":"We can overcome the pessimism in worst-case routing latency analysis of timing-predictable Network-on-Chip (NoC) workloads by single-digit factors through the use of a hybrid field-programmable gate array (FPGA)–optimized NoC and workload-adapted regulation. Timing-predictable FPGA-optimized NoCs such as HopliteBuf integrate stall-free FIFOs that are sized using offline static analysis of a user-supplied flow pattern and rates. For certain bursty traffic and flow configurations, static analysis delivers very large, sometimes infeasible, FIFO size bounds and large worst-case latency bounds. Alternatively, backpressure-based NoCs such as HopliteBP can operate with lower latencies for certain bursty flows. However, they suffer from severe pessimism in the analysis due to the effect of pipelining of packets and interleaving of flows at switch ports. As we show in this article, a hybrid FPGA NoC that seamlessly composes both design styles on a per-switch basis delivers the best of both worlds, with improved feasibility (bounded operation) and tighter latency bounds. We select the NoC switch configuration through a novel evolutionary algorithm based on Maximum Likelihood Estimation (MLE). For synthetic (RANDOM, LOCAL) and real-world (SpMV, Graph) workloads, we demonstrate ≈2–3× improvements in feasibility and ≈1–6.8× in worst-case latency while requiring an LUT cost only ≈1–1.5× larger than the cheapest HopliteBuf solution. We also deploy and verify our NoC (PL) and MLE framework (PS) on a Pynq-Z1 to adapt and reconfigure NoC switches dynamically. We can further improve a workload’s routability by learning to surgically tune regulation rates for each traffic trace to maximize available routing bandwidth. We capture critical dependency between traces by modelling the regulation space as a multivariate Gaussian distribution and learn the distribution’s parameters using Covariance Matrix Adaptation Evolution Strategy (CMA-ES). We also propose nested learning, which learns switch configurations and regulation rates in tandem. Compared with stand-alone switch learning, this symbiotic nested learning helps achieve ≈ 1.5× lower cost constrained latency, ≈ 3.1× faster individual rates, and ≈ 1.4× faster mean rates. We also evaluate improvements to vanilla NoCs’ routing using only stand-alone rate learning (no switch learning), with ≈ 1.6× lower latency across synthetic and real-world benchmarks.","PeriodicalId":162787,"journal":{"name":"ACM Transactions on Reconfigurable Technology and Systems (TRETS)","volume":"54 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-02-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"HopliteML: Evolving Application Customized FPGA NoCs with Adaptable Routers and Regulators\",\"authors\":\"G. Malik, Ian Elmor Lang, R. Pellizzoni, Nachiket Kapre\",\"doi\":\"10.1145/3507699\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"We can overcome the pessimism in worst-case routing latency analysis of timing-predictable Network-on-Chip (NoC) workloads by single-digit factors through the use of a hybrid field-programmable gate array (FPGA)–optimized NoC and workload-adapted regulation. Timing-predictable FPGA-optimized NoCs such as HopliteBuf integrate stall-free FIFOs that are sized using offline static analysis of a user-supplied flow pattern and rates. For certain bursty traffic and flow configurations, static analysis delivers very large, sometimes infeasible, FIFO size bounds and large worst-case latency bounds. Alternatively, backpressure-based NoCs such as HopliteBP can operate with lower latencies for certain bursty flows. However, they suffer from severe pessimism in the analysis due to the effect of pipelining of packets and interleaving of flows at switch ports. As we show in this article, a hybrid FPGA NoC that seamlessly composes both design styles on a per-switch basis delivers the best of both worlds, with improved feasibility (bounded operation) and tighter latency bounds. We select the NoC switch configuration through a novel evolutionary algorithm based on Maximum Likelihood Estimation (MLE). For synthetic (RANDOM, LOCAL) and real-world (SpMV, Graph) workloads, we demonstrate ≈2–3× improvements in feasibility and ≈1–6.8× in worst-case latency while requiring an LUT cost only ≈1–1.5× larger than the cheapest HopliteBuf solution. We also deploy and verify our NoC (PL) and MLE framework (PS) on a Pynq-Z1 to adapt and reconfigure NoC switches dynamically. We can further improve a workload’s routability by learning to surgically tune regulation rates for each traffic trace to maximize available routing bandwidth. We capture critical dependency between traces by modelling the regulation space as a multivariate Gaussian distribution and learn the distribution’s parameters using Covariance Matrix Adaptation Evolution Strategy (CMA-ES). We also propose nested learning, which learns switch configurations and regulation rates in tandem. Compared with stand-alone switch learning, this symbiotic nested learning helps achieve ≈ 1.5× lower cost constrained latency, ≈ 3.1× faster individual rates, and ≈ 1.4× faster mean rates. We also evaluate improvements to vanilla NoCs’ routing using only stand-alone rate learning (no switch learning), with ≈ 1.6× lower latency across synthetic and real-world benchmarks.\",\"PeriodicalId\":162787,\"journal\":{\"name\":\"ACM Transactions on Reconfigurable Technology and Systems (TRETS)\",\"volume\":\"54 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2022-02-14\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"ACM Transactions on Reconfigurable Technology and Systems (TRETS)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3507699\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"ACM Transactions on Reconfigurable Technology and Systems (TRETS)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3507699","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

通过使用混合现场可编程门阵列(FPGA)优化的片上网络(NoC)和自适应工作负载的调节，我们可以克服对时间可预测的片上网络(NoC)工作负载的最坏情况路由延迟分析的悲观情绪。时间可预测的fpga优化noc(如HopliteBuf)集成了无失速fifo，通过对用户提供的流模式和速率进行离线静态分析来确定大小。对于某些突发流量和流配置，静态分析提供了非常大的(有时是不可行的)FIFO大小界限和大的最坏情况延迟界限。另外，HopliteBP等基于背压的noc可以在某些突发流量下以较低的延迟运行。然而，由于数据包的流水线化和交换机端口流的交错影响，他们在分析中遭受严重的悲观。正如我们在本文中所展示的，在每个交换机的基础上无缝组合两种设计风格的混合FPGA NoC提供了两全其美的效果，具有改进的可行性(有界操作)和更严格的延迟界限。我们通过一种基于极大似然估计(MLE)的进化算法来选择NoC开关配置。对于合成(RANDOM, LOCAL)和现实世界(SpMV, Graph)工作负载，我们证明了在可行性上的≈2 - 3倍的改进，在最坏情况下的延迟上的≈1 - 6.8倍的改进，而所需的LUT成本仅比最便宜的HopliteBuf解决方案大≈1 - 1.5倍。我们还在Pynq-Z1上部署和验证了我们的NoC (PL)和MLE框架(PS)，以动态地适应和重新配置NoC交换机。我们可以进一步提高工作负载的可达性，方法是学习为每个流量跟踪调整调节速率，以最大化可用的路由带宽。我们通过将调节空间建模为多元高斯分布来捕获轨迹之间的关键依赖关系，并使用协方差矩阵自适应进化策略(CMA-ES)来学习分布的参数。我们还提出了嵌套学习，它可以串联学习开关配置和调节速率。与独立开关学习相比，这种共生嵌套学习有助于实现≈1.5倍的成本约束延迟，≈3.1倍的个体速率和≈1.4倍的平均速率。我们还评估了仅使用独立速率学习(没有交换机学习)对香草noc路由的改进，在合成和实际基准测试中具有≈1.6倍的低延迟。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

HopliteML: Evolving Application Customized FPGA NoCs with Adaptable Routers and Regulators

We can overcome the pessimism in worst-case routing latency analysis of timing-predictable Network-on-Chip (NoC) workloads by single-digit factors through the use of a hybrid field-programmable gate array (FPGA)–optimized NoC and workload-adapted regulation. Timing-predictable FPGA-optimized NoCs such as HopliteBuf integrate stall-free FIFOs that are sized using offline static analysis of a user-supplied flow pattern and rates. For certain bursty traffic and flow configurations, static analysis delivers very large, sometimes infeasible, FIFO size bounds and large worst-case latency bounds. Alternatively, backpressure-based NoCs such as HopliteBP can operate with lower latencies for certain bursty flows. However, they suffer from severe pessimism in the analysis due to the effect of pipelining of packets and interleaving of flows at switch ports. As we show in this article, a hybrid FPGA NoC that seamlessly composes both design styles on a per-switch basis delivers the best of both worlds, with improved feasibility (bounded operation) and tighter latency bounds. We select the NoC switch configuration through a novel evolutionary algorithm based on Maximum Likelihood Estimation (MLE). For synthetic (RANDOM, LOCAL) and real-world (SpMV, Graph) workloads, we demonstrate ≈2–3× improvements in feasibility and ≈1–6.8× in worst-case latency while requiring an LUT cost only ≈1–1.5× larger than the cheapest HopliteBuf solution. We also deploy and verify our NoC (PL) and MLE framework (PS) on a Pynq-Z1 to adapt and reconfigure NoC switches dynamically. We can further improve a workload’s routability by learning to surgically tune regulation rates for each traffic trace to maximize available routing bandwidth. We capture critical dependency between traces by modelling the regulation space as a multivariate Gaussian distribution and learn the distribution’s parameters using Covariance Matrix Adaptation Evolution Strategy (CMA-ES). We also propose nested learning, which learns switch configurations and regulation rates in tandem. Compared with stand-alone switch learning, this symbiotic nested learning helps achieve ≈ 1.5× lower cost constrained latency, ≈ 3.1× faster individual rates, and ≈ 1.4× faster mean rates. We also evaluate improvements to vanilla NoCs’ routing using only stand-alone rate learning (no switch learning), with ≈ 1.6× lower latency across synthetic and real-world benchmarks.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

ACM Transactions on Reconfigurable Technology and Systems (TRETS)

自引率

0.00%

发文量