Interconnect synthesis of heterogeneous accelerators in a shared memory architecture

2015 IEEE/ACM International Symposium on Low Power Electronics and Design (ISLPED) Pub Date : 2015-07-22 DOI:10.1109/ISLPED.2015.7273540

Yu-Ting Chen, J. Cong

{"title":"Interconnect synthesis of heterogeneous accelerators in a shared memory architecture","authors":"Yu-Ting Chen, J. Cong","doi":"10.1109/ISLPED.2015.7273540","DOIUrl":null,"url":null,"abstract":"An accelerator-rich architecture (ARA) is composed of heterogeneous accelerators with an on-chip memory system. Compared to the general-purpose processors, an accelerator demands short and predictable latency to its local on-chip memory to satisfy its performance target. Moreover, an accelerator requires a much higher off-chip memory bandwidth than a CPU since it consumes much more data in a given time period. Therefore, a customized on-chip memory system design is one of the keys to an efficient ARA. In this work we provide a two-layer interconnect synthesis method. We first provide an optimal layer of partial crossbar that connects the heterogeneous accelerators and shared memory banks with a minimum number of switches. The second layer of interconnect tries to interleave possible conflicting long-burst memory requests for prefetching data from off-chip memory. The experimental results show that we can reduce more than 45% of the switches of the partial crossbar compared to the best known method. This further leads to 53% reduction of LUTs and 34% reduction of slice utilization on a 30-accelerator FPGA prototype. Furthermore, the performance of an ARA can be improved by 36% - 52% with a well-designed interleaved network in a real ARA prototype for medical imaging applications. This prototype also shows a 7.44x energy efficiency gain over the state-of-the-art Xeon processors.","PeriodicalId":421236,"journal":{"name":"2015 IEEE/ACM International Symposium on Low Power Electronics and Design (ISLPED)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2015-07-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"5","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2015 IEEE/ACM International Symposium on Low Power Electronics and Design (ISLPED)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ISLPED.2015.7273540","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 5

Abstract

An accelerator-rich architecture (ARA) is composed of heterogeneous accelerators with an on-chip memory system. Compared to the general-purpose processors, an accelerator demands short and predictable latency to its local on-chip memory to satisfy its performance target. Moreover, an accelerator requires a much higher off-chip memory bandwidth than a CPU since it consumes much more data in a given time period. Therefore, a customized on-chip memory system design is one of the keys to an efficient ARA. In this work we provide a two-layer interconnect synthesis method. We first provide an optimal layer of partial crossbar that connects the heterogeneous accelerators and shared memory banks with a minimum number of switches. The second layer of interconnect tries to interleave possible conflicting long-burst memory requests for prefetching data from off-chip memory. The experimental results show that we can reduce more than 45% of the switches of the partial crossbar compared to the best known method. This further leads to 53% reduction of LUTs and 34% reduction of slice utilization on a 30-accelerator FPGA prototype. Furthermore, the performance of an ARA can be improved by 36% - 52% with a well-designed interleaved network in a real ARA prototype for medical imaging applications. This prototype also shows a 7.44x energy efficiency gain over the state-of-the-art Xeon processors.

查看原文本刊更多论文

共享内存体系结构中异构加速器的互连合成

富加速器架构(ARA)是由异构加速器和片上存储系统组成的。与通用处理器相比，加速器对其本地片上存储器的延迟要求较短且可预测，以满足其性能目标。此外，加速器需要比CPU高得多的片外内存带宽，因为它在给定时间段内消耗更多的数据。因此，定制的片上存储系统设计是高效ARA的关键之一。在这项工作中，我们提供了一种双层互连合成方法。我们首先提供了一个最佳的部分横杆层，它用最少数量的交换机连接异构加速器和共享内存库。互连的第二层试图交错可能冲突的长突发存储器请求，以便从片外存储器预取数据。实验结果表明，与目前已知的方法相比，我们可以减少45%以上的部分横杆开关。这进一步导致在30个加速器的FPGA原型上，lut减少53%，片利用率减少34%。此外，在医疗成像应用的真实ARA原型中，通过精心设计的交错网络，ARA的性能可以提高36% - 52%。与最先进的至强处理器相比，该原型机的能效提高了7.44倍。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2015 IEEE/ACM International Symposium on Low Power Electronics and Design (ISLPED)

自引率

0.00%

发文量