A just-in-time modulo scheduling for virtual coarse-grained reconfigurable architectures

2013 International Conference on Embedded Computer Systems: Architectures, Modeling, and Simulation (SAMOS) Pub Date : 2013-07-15 DOI:10.1109/SAMOS.2013.6621122

R. Ferreira, Vinicius Duarte, Waldir Meireles, M. Pereira, L. Carro, Stephan Wong

{"title":"A just-in-time modulo scheduling for virtual coarse-grained reconfigurable architectures","authors":"R. Ferreira, Vinicius Duarte, Waldir Meireles, M. Pereira, L. Carro, Stephan Wong","doi":"10.1109/SAMOS.2013.6621122","DOIUrl":null,"url":null,"abstract":"In the past decade, most solutions concerning the mapping of the compute-intensive loop kernels to accelerators have used heuristics and compiler-based strategies. These facts require that most of the decisions be taken at design time, thus precluding efficient solutions that can take run-time information into account. Any success in accelerating such applications greatly depends on two steps, extracting the loops and mapping them into the architecture. This last step is a challenge in itself since it is a NP-complete problem. In this paper, we propose a runtime solution that can provide speed ups of 3 to 6 orders of magnitude for the mapping step when compared to the state-of-the-art at minimal performance degradation, by the combined usage of 3 distinct mechanisms: 1) a simple and efficient modulo scheduling heuristic, 2) a crossbar network, which simplifies the placement and routing, 3) a virtual coarse-grained reconfigurable architecture (CGRA). Additionally, since the CGRA is a virtual layer on top of an FPGA, it is possible to use any off-the-shelf FPGA without the need of special tools or IP solutions. Although the mapping is NP-complete even for crossbar-based CGRAs, experimental results demonstrate a huge reduction in compilation time, as opposed to previous solutions that require seconds to map the applications, our solution requires only microseconds to find near optimal schedules. Besides the speed up, the proposed solution enables the use of just-in-time compilation, hence it is intrinsically adaptive to a changing scenario.","PeriodicalId":382307,"journal":{"name":"2013 International Conference on Embedded Computer Systems: Architectures, Modeling, and Simulation (SAMOS)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2013-07-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"12","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2013 International Conference on Embedded Computer Systems: Architectures, Modeling, and Simulation (SAMOS)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/SAMOS.2013.6621122","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 12

Abstract

In the past decade, most solutions concerning the mapping of the compute-intensive loop kernels to accelerators have used heuristics and compiler-based strategies. These facts require that most of the decisions be taken at design time, thus precluding efficient solutions that can take run-time information into account. Any success in accelerating such applications greatly depends on two steps, extracting the loops and mapping them into the architecture. This last step is a challenge in itself since it is a NP-complete problem. In this paper, we propose a runtime solution that can provide speed ups of 3 to 6 orders of magnitude for the mapping step when compared to the state-of-the-art at minimal performance degradation, by the combined usage of 3 distinct mechanisms: 1) a simple and efficient modulo scheduling heuristic, 2) a crossbar network, which simplifies the placement and routing, 3) a virtual coarse-grained reconfigurable architecture (CGRA). Additionally, since the CGRA is a virtual layer on top of an FPGA, it is possible to use any off-the-shelf FPGA without the need of special tools or IP solutions. Although the mapping is NP-complete even for crossbar-based CGRAs, experimental results demonstrate a huge reduction in compilation time, as opposed to previous solutions that require seconds to map the applications, our solution requires only microseconds to find near optimal schedules. Besides the speed up, the proposed solution enables the use of just-in-time compilation, hence it is intrinsically adaptive to a changing scenario.

查看原文本刊更多论文

虚拟粗粒度可重构体系结构的实时模调度

在过去的十年中，大多数关于将计算密集型循环内核映射到加速器的解决方案都使用了启发式和基于编译器的策略。这些事实要求在设计时做出大多数决策，从而排除了考虑运行时信息的有效解决方案。任何加速此类应用程序的成功在很大程度上取决于两个步骤:提取循环并将它们映射到体系结构中。最后一步本身就是一个挑战，因为它是一个np完全问题。在本文中，我们提出了一种运行时解决方案，通过组合使用3种不同的机制，可以在最小性能下降的情况下，为映射步骤提供3到6个数量级的速度提升:1)简单有效的模调度启发式，2)简化放置和路由的交叉网络，3)虚拟粗粒度可重构体系结构(CGRA)。此外，由于CGRA是FPGA之上的虚拟层，因此可以使用任何现成的FPGA，而无需特殊工具或IP解决方案。尽管即使对于基于交叉栏的CGRAs，映射也是np完全的，但实验结果表明编译时间大大减少，与以前需要几秒钟来映射应用程序的解决方案相反，我们的解决方案只需要几微秒就可以找到接近最优的调度。除了加速之外，建议的解决方案支持使用即时编译，因此它本质上适应不断变化的场景。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2013 International Conference on Embedded Computer Systems: Architectures, Modeling, and Simulation (SAMOS)

自引率

0.00%

发文量