一种用于节能计算的单片三维混合结构

Ye Yu;Niraj K. Jha
{"title":"一种用于节能计算的单片三维混合结构","authors":"Ye Yu;Niraj K. Jha","doi":"10.1109/TMSCS.2018.2882433","DOIUrl":null,"url":null,"abstract":"The exponentially increasing performance of chip multiprocessors (CMPs) predicted by Moore's Law is no longer due to the increasing clock rate of a single CPU core, but on account of the increase of core counts in the CMP. More transistors are integrated within the same footprint area as the technology node shrinks to deliver higher performance. However, this is accompanied by higher power dissipation that usually exceeds the coping capability of inexpensive cooling techniques. This Power Wall prevents the chip from running at full speed with all the devices powered-on. This is known as the dark silicon problem. Another major bottleneck in CMP development is the imbalance between the CPU clock rate and memory access speed. This Memory Wall keeps the CPU from fully utilizing its compute power. To address both the Power and Memory Walls, we propose a monolithic 3D hybrid architecture that consists of a multi-core CPU tier, a fine-grain dynamically reconfigurable (FDR) field-programmable gate array (FPGA) tier, and multiple resistive RAM (RRAM) tiers. The FDR tier is used as an accelerator. It uses the concept of temporal logic folding to localize on-chip communication. The RRAM tiers are connected to the CPU and FDR tiers through an efficient memory interface that takes advantage of the tremendous bandwidth available from monolithic inter-tier vias and hides the latency of large data transfers. We evaluate the architecture on two types of benchmarks: compute-intensive and memory-intensive. We show that the architecture reduces both power and energy significantly at a better performance for both types of applications. Compared to the baseline, our architecture achieves an average of 43.1× and 2.5× speedup on compute-intensive and memory-intensive benchmarks, respectively. The power and energy consumption are reduced by 5.0× and 40.5×, respectively, for compute-intensive applications, and 2.0× and 4.2×, respectively, for memory-intensive applications. This translates to 1745.3× energy-delay product (EDP) improvement for compute-intensive applications and 10.5× for memory-intensive applications.","PeriodicalId":100643,"journal":{"name":"IEEE Transactions on Multi-Scale Computing Systems","volume":"4 4","pages":"533-547"},"PeriodicalIF":0.0000,"publicationDate":"2018-11-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1109/TMSCS.2018.2882433","citationCount":"7","resultStr":"{\"title\":\"A Monolithic 3D Hybrid Architecture for Energy-Efficient Computation\",\"authors\":\"Ye Yu;Niraj K. Jha\",\"doi\":\"10.1109/TMSCS.2018.2882433\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"The exponentially increasing performance of chip multiprocessors (CMPs) predicted by Moore's Law is no longer due to the increasing clock rate of a single CPU core, but on account of the increase of core counts in the CMP. More transistors are integrated within the same footprint area as the technology node shrinks to deliver higher performance. However, this is accompanied by higher power dissipation that usually exceeds the coping capability of inexpensive cooling techniques. This Power Wall prevents the chip from running at full speed with all the devices powered-on. This is known as the dark silicon problem. Another major bottleneck in CMP development is the imbalance between the CPU clock rate and memory access speed. This Memory Wall keeps the CPU from fully utilizing its compute power. To address both the Power and Memory Walls, we propose a monolithic 3D hybrid architecture that consists of a multi-core CPU tier, a fine-grain dynamically reconfigurable (FDR) field-programmable gate array (FPGA) tier, and multiple resistive RAM (RRAM) tiers. The FDR tier is used as an accelerator. It uses the concept of temporal logic folding to localize on-chip communication. The RRAM tiers are connected to the CPU and FDR tiers through an efficient memory interface that takes advantage of the tremendous bandwidth available from monolithic inter-tier vias and hides the latency of large data transfers. We evaluate the architecture on two types of benchmarks: compute-intensive and memory-intensive. We show that the architecture reduces both power and energy significantly at a better performance for both types of applications. Compared to the baseline, our architecture achieves an average of 43.1× and 2.5× speedup on compute-intensive and memory-intensive benchmarks, respectively. The power and energy consumption are reduced by 5.0× and 40.5×, respectively, for compute-intensive applications, and 2.0× and 4.2×, respectively, for memory-intensive applications. This translates to 1745.3× energy-delay product (EDP) improvement for compute-intensive applications and 10.5× for memory-intensive applications.\",\"PeriodicalId\":100643,\"journal\":{\"name\":\"IEEE Transactions on Multi-Scale Computing Systems\",\"volume\":\"4 4\",\"pages\":\"533-547\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2018-11-20\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://sci-hub-pdf.com/10.1109/TMSCS.2018.2882433\",\"citationCount\":\"7\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"IEEE Transactions on Multi-Scale Computing Systems\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://ieeexplore.ieee.org/document/8540885/\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Multi-Scale Computing Systems","FirstCategoryId":"1085","ListUrlMain":"https://ieeexplore.ieee.org/document/8540885/","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 7

摘要

摩尔定律预测的芯片多处理器(CMP)性能的指数级增长不再是由于单个CPU内核的时钟速率的增加,而是由于CMP中内核数量的增加。随着技术节点的缩小,更多的晶体管集成在相同的占地面积内,以提供更高的性能。然而,伴随而来的是更高的功耗,通常超过了廉价冷却技术的应对能力。此电源墙可防止芯片在所有设备通电的情况下全速运行。这被称为暗硅问题。CMP开发中的另一个主要瓶颈是CPU时钟速率和内存访问速度之间的不平衡。内存墙使CPU无法充分利用其计算能力。为了解决电源墙和内存墙问题,我们提出了一种单片3D混合架构,该架构由多核CPU层、细粒度动态可重构(FDR)现场可编程门阵列(FPGA)层和多个电阻RAM(RRAM)层组成。FDR层用作加速器。它使用时间逻辑折叠的概念来定位片上通信。RRAM层通过高效的内存接口连接到CPU和FDR层,该接口利用了单片层间过孔的巨大带宽,并隐藏了大型数据传输的延迟。我们根据两种类型的基准测试来评估体系结构:计算密集型和内存密集型。我们表明,该体系结构在两种类型的应用程序中都能以更好的性能显著降低功耗和能耗。与基线相比,我们的体系结构在计算密集型和内存密集型基准测试上分别实现了平均43.1倍和2.5倍的加速。对于计算密集型应用,功耗和能耗分别降低了5.0倍和40.5倍,对于内存密集型应用则分别降低了2.0倍和4.2倍。这意味着,对于计算密集型应用,能量延迟乘积(EDP)提高了1745.3倍,对于内存密集型应用提高了10.5倍。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
A Monolithic 3D Hybrid Architecture for Energy-Efficient Computation
The exponentially increasing performance of chip multiprocessors (CMPs) predicted by Moore's Law is no longer due to the increasing clock rate of a single CPU core, but on account of the increase of core counts in the CMP. More transistors are integrated within the same footprint area as the technology node shrinks to deliver higher performance. However, this is accompanied by higher power dissipation that usually exceeds the coping capability of inexpensive cooling techniques. This Power Wall prevents the chip from running at full speed with all the devices powered-on. This is known as the dark silicon problem. Another major bottleneck in CMP development is the imbalance between the CPU clock rate and memory access speed. This Memory Wall keeps the CPU from fully utilizing its compute power. To address both the Power and Memory Walls, we propose a monolithic 3D hybrid architecture that consists of a multi-core CPU tier, a fine-grain dynamically reconfigurable (FDR) field-programmable gate array (FPGA) tier, and multiple resistive RAM (RRAM) tiers. The FDR tier is used as an accelerator. It uses the concept of temporal logic folding to localize on-chip communication. The RRAM tiers are connected to the CPU and FDR tiers through an efficient memory interface that takes advantage of the tremendous bandwidth available from monolithic inter-tier vias and hides the latency of large data transfers. We evaluate the architecture on two types of benchmarks: compute-intensive and memory-intensive. We show that the architecture reduces both power and energy significantly at a better performance for both types of applications. Compared to the baseline, our architecture achieves an average of 43.1× and 2.5× speedup on compute-intensive and memory-intensive benchmarks, respectively. The power and energy consumption are reduced by 5.0× and 40.5×, respectively, for compute-intensive applications, and 2.0× and 4.2×, respectively, for memory-intensive applications. This translates to 1745.3× energy-delay product (EDP) improvement for compute-intensive applications and 10.5× for memory-intensive applications.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信