Interconnect synthesis of heterogeneous accelerators in a shared memory architecture

Yu-Ting Chen, J. Cong
{"title":"Interconnect synthesis of heterogeneous accelerators in a shared memory architecture","authors":"Yu-Ting Chen, J. Cong","doi":"10.1109/ISLPED.2015.7273540","DOIUrl":null,"url":null,"abstract":"An accelerator-rich architecture (ARA) is composed of heterogeneous accelerators with an on-chip memory system. Compared to the general-purpose processors, an accelerator demands short and predictable latency to its local on-chip memory to satisfy its performance target. Moreover, an accelerator requires a much higher off-chip memory bandwidth than a CPU since it consumes much more data in a given time period. Therefore, a customized on-chip memory system design is one of the keys to an efficient ARA. In this work we provide a two-layer interconnect synthesis method. We first provide an optimal layer of partial crossbar that connects the heterogeneous accelerators and shared memory banks with a minimum number of switches. The second layer of interconnect tries to interleave possible conflicting long-burst memory requests for prefetching data from off-chip memory. The experimental results show that we can reduce more than 45% of the switches of the partial crossbar compared to the best known method. This further leads to 53% reduction of LUTs and 34% reduction of slice utilization on a 30-accelerator FPGA prototype. Furthermore, the performance of an ARA can be improved by 36% - 52% with a well-designed interleaved network in a real ARA prototype for medical imaging applications. This prototype also shows a 7.44x energy efficiency gain over the state-of-the-art Xeon processors.","PeriodicalId":421236,"journal":{"name":"2015 IEEE/ACM International Symposium on Low Power Electronics and Design (ISLPED)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2015-07-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"5","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2015 IEEE/ACM International Symposium on Low Power Electronics and Design (ISLPED)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ISLPED.2015.7273540","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 5

Abstract

An accelerator-rich architecture (ARA) is composed of heterogeneous accelerators with an on-chip memory system. Compared to the general-purpose processors, an accelerator demands short and predictable latency to its local on-chip memory to satisfy its performance target. Moreover, an accelerator requires a much higher off-chip memory bandwidth than a CPU since it consumes much more data in a given time period. Therefore, a customized on-chip memory system design is one of the keys to an efficient ARA. In this work we provide a two-layer interconnect synthesis method. We first provide an optimal layer of partial crossbar that connects the heterogeneous accelerators and shared memory banks with a minimum number of switches. The second layer of interconnect tries to interleave possible conflicting long-burst memory requests for prefetching data from off-chip memory. The experimental results show that we can reduce more than 45% of the switches of the partial crossbar compared to the best known method. This further leads to 53% reduction of LUTs and 34% reduction of slice utilization on a 30-accelerator FPGA prototype. Furthermore, the performance of an ARA can be improved by 36% - 52% with a well-designed interleaved network in a real ARA prototype for medical imaging applications. This prototype also shows a 7.44x energy efficiency gain over the state-of-the-art Xeon processors.
共享内存体系结构中异构加速器的互连合成
富加速器架构(ARA)是由异构加速器和片上存储系统组成的。与通用处理器相比,加速器对其本地片上存储器的延迟要求较短且可预测,以满足其性能目标。此外,加速器需要比CPU高得多的片外内存带宽,因为它在给定时间段内消耗更多的数据。因此,定制的片上存储系统设计是高效ARA的关键之一。在这项工作中,我们提供了一种双层互连合成方法。我们首先提供了一个最佳的部分横杆层,它用最少数量的交换机连接异构加速器和共享内存库。互连的第二层试图交错可能冲突的长突发存储器请求,以便从片外存储器预取数据。实验结果表明,与目前已知的方法相比,我们可以减少45%以上的部分横杆开关。这进一步导致在30个加速器的FPGA原型上,lut减少53%,片利用率减少34%。此外,在医疗成像应用的真实ARA原型中,通过精心设计的交错网络,ARA的性能可以提高36% - 52%。与最先进的至强处理器相比,该原型机的能效提高了7.44倍。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信