在嵌入式处理环境中设计负载/存储缓冲区的权衡

Y. Kang, J. Draper
{"title":"在嵌入式处理环境中设计负载/存储缓冲区的权衡","authors":"Y. Kang, J. Draper","doi":"10.1109/MWSCAS.2007.4488819","DOIUrl":null,"url":null,"abstract":"Memory latency is a critical issue for conventional high-speed computing platforms, and it is becoming a common problem in embedded and CMP (chip multiprocessing) systems as well. Conventional processors typically adopt caches and a load/store queue (LSQ) to address the processor-to-memory bottleneck. However, the conventional LSQ design, which has a large number of entries, is not appropriate for embedded systems due to its area and power hungry out-of- order speculation. A compact, low-power load/store buffer that also provides significant performance improvement is essential for such systems. In this paper, we propose an area-efficient wideword load/store buffer (WLSB) which supports both WideWord (256-bit) and scalar (32-bit) load/store instructions for a recently fabricated PIM (processing-in-memory) device. Given its small size, the 4 entry WLSB yields a 57.33% load hit rate on SPEC2K benchmarks. This result is 5.72% better as compared to a less area-efficient 32-entry fully associative scalar load/store buffer (SLSB). The WLSB was synthesized in IBM 90 nm technology, and the resulting implementation occupies less than a seventh of a square mm and is projected to run at 1.6 ns cycle time with 15.72 mW of dynamic power dissipation. This paper demonstrates how this very small-entry buffer can affect the load hit rate and quantifies the design trade-offs between wide small-entry and narrow large-entry buffers with respect to size, power, load hit ratio and clock speed. Although this WLSB has been specifically designed to benefit a PIM architecture, it is expected to be useful for other embedded processing platforms and CMPs due to emphasized area/power constraints.","PeriodicalId":256061,"journal":{"name":"2007 50th Midwest Symposium on Circuits and Systems","volume":"29 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2007-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Design trade-offs for load/store buffers in embedded processing environments\",\"authors\":\"Y. Kang, J. Draper\",\"doi\":\"10.1109/MWSCAS.2007.4488819\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Memory latency is a critical issue for conventional high-speed computing platforms, and it is becoming a common problem in embedded and CMP (chip multiprocessing) systems as well. Conventional processors typically adopt caches and a load/store queue (LSQ) to address the processor-to-memory bottleneck. However, the conventional LSQ design, which has a large number of entries, is not appropriate for embedded systems due to its area and power hungry out-of- order speculation. A compact, low-power load/store buffer that also provides significant performance improvement is essential for such systems. In this paper, we propose an area-efficient wideword load/store buffer (WLSB) which supports both WideWord (256-bit) and scalar (32-bit) load/store instructions for a recently fabricated PIM (processing-in-memory) device. Given its small size, the 4 entry WLSB yields a 57.33% load hit rate on SPEC2K benchmarks. This result is 5.72% better as compared to a less area-efficient 32-entry fully associative scalar load/store buffer (SLSB). The WLSB was synthesized in IBM 90 nm technology, and the resulting implementation occupies less than a seventh of a square mm and is projected to run at 1.6 ns cycle time with 15.72 mW of dynamic power dissipation. This paper demonstrates how this very small-entry buffer can affect the load hit rate and quantifies the design trade-offs between wide small-entry and narrow large-entry buffers with respect to size, power, load hit ratio and clock speed. Although this WLSB has been specifically designed to benefit a PIM architecture, it is expected to be useful for other embedded processing platforms and CMPs due to emphasized area/power constraints.\",\"PeriodicalId\":256061,\"journal\":{\"name\":\"2007 50th Midwest Symposium on Circuits and Systems\",\"volume\":\"29 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2007-08-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2007 50th Midwest Symposium on Circuits and Systems\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/MWSCAS.2007.4488819\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2007 50th Midwest Symposium on Circuits and Systems","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/MWSCAS.2007.4488819","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

摘要

内存延迟是传统高速计算平台的一个关键问题,并且在嵌入式和芯片多处理(CMP)系统中也成为一个普遍问题。传统处理器通常采用缓存和加载/存储队列(LSQ)来解决处理器到内存的瓶颈。然而,传统的LSQ设计具有大量的条目,由于其面积和功耗的失序推测而不适合嵌入式系统。紧凑、低功耗的负载/存储缓冲器也提供了显著的性能改进,这对此类系统至关重要。在本文中,我们提出了一种面积高效的宽字加载/存储缓冲区(WLSB),它支持宽字(256位)和标量(32位)加载/存储指令,用于最近制造的PIM(内存中处理)设备。考虑到它的小尺寸,4条目WLSB在SPEC2K基准测试中产生57.33%的负载命中率。与面积效率较低的32项全关联标量加载/存储缓冲区(SLSB)相比,这个结果好5.72%。WLSB是在IBM 90nm技术中合成的,最终实现占地不到七分之一平方毫米,预计运行周期为1.6 ns,动态功耗为15.72 mW。本文演示了这个非常小的入口缓冲区如何影响负载命中率,并量化了宽的小入口缓冲区和窄的大入口缓冲区在大小、功率、负载命中率和时钟速度方面的设计权衡。虽然这个WLSB是专门为PIM架构设计的,但由于强调面积/功率限制,它预计对其他嵌入式处理平台和cmp也很有用。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
Design trade-offs for load/store buffers in embedded processing environments
Memory latency is a critical issue for conventional high-speed computing platforms, and it is becoming a common problem in embedded and CMP (chip multiprocessing) systems as well. Conventional processors typically adopt caches and a load/store queue (LSQ) to address the processor-to-memory bottleneck. However, the conventional LSQ design, which has a large number of entries, is not appropriate for embedded systems due to its area and power hungry out-of- order speculation. A compact, low-power load/store buffer that also provides significant performance improvement is essential for such systems. In this paper, we propose an area-efficient wideword load/store buffer (WLSB) which supports both WideWord (256-bit) and scalar (32-bit) load/store instructions for a recently fabricated PIM (processing-in-memory) device. Given its small size, the 4 entry WLSB yields a 57.33% load hit rate on SPEC2K benchmarks. This result is 5.72% better as compared to a less area-efficient 32-entry fully associative scalar load/store buffer (SLSB). The WLSB was synthesized in IBM 90 nm technology, and the resulting implementation occupies less than a seventh of a square mm and is projected to run at 1.6 ns cycle time with 15.72 mW of dynamic power dissipation. This paper demonstrates how this very small-entry buffer can affect the load hit rate and quantifies the design trade-offs between wide small-entry and narrow large-entry buffers with respect to size, power, load hit ratio and clock speed. Although this WLSB has been specifically designed to benefit a PIM architecture, it is expected to be useful for other embedded processing platforms and CMPs due to emphasized area/power constraints.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信