320mv -1.2 v片上细粒度可重构结构,用于32nm CMOS的DSP/媒体加速器

A. Agarwal, S. Mathew, S. Hsu, M. Anders, Himanshu Kaul, F. Sheikh, R. Ramanarayanan, S. Srinivasan, R. Krishnamurthy, S. Borkar
{"title":"320mv -1.2 v片上细粒度可重构结构,用于32nm CMOS的DSP/媒体加速器","authors":"A. Agarwal, S. Mathew, S. Hsu, M. Anders, Himanshu Kaul, F. Sheikh, R. Ramanarayanan, S. Srinivasan, R. Krishnamurthy, S. Borkar","doi":"10.1109/ISSCC.2010.5433903","DOIUrl":null,"url":null,"abstract":"Computationally intensive DSP/media processing applications require specialized hardware accelerators to enable higher energy-efficiency on microprocessor platforms. On-die reconfigurable arrays enable flexible accelerators with dynamic on-the-fly programmability while amortizing die area and time-to-market costs across a wide range of workloads. An ultra-low-voltage fine-grained reconfigurable fabric consisting of a hybrid configurable logic block (CLB) array with process/voltage/temperature (PVT) variation-tolerant register file (Fig. 18.2.1), targeted for on-die acceleration of DSP/media algorithms on power-constrained mobile microprocessors, is fabricated in 32nm high-k/metal-gate CMOS [1]. The CLB combines self-decoded look-up tables (LUTs) for random logic with reconfigurable arithmetic building blocks, hybrid 3∶2 compressors with integrated partial product generation, configurable adder/multiplier carry propagation and optimized CLB input/output multiplexers to achieve peak energy-efficiency of 2.6TOPS/W measured at 340mV, 50°C. The register file includes programmable stacked shared keepers and interruptible operation of both write memory cells and set-dominant latches (SDLs), improving Vcc-min by 300mV across PVT variations with a wide dynamic operating range of 320mV–1.2V, enabling simultaneous dynamic supply/frequency optimization across target workloads and power budgets. These features also achieve: (i) nominal CLB performance of 2.4GHz, 5.3mW measured at 1.0V, (ii) robust CLB functionality measured at 260mV, 27MHz (sub-threshold) consuming 12µW, (iii) scalable register file performance up to 8.2GHz, 125mW measured at 1.2V, 50°C with low-voltage near-threshold operation at 320mV, 252MHz consuming 430µW, (iv) 4-tap FIR filter, radix-2 FFT butterfly and 16b string-match algorithms with peak throughput of 2.1GSamples/s, 2.4GSamples/s and 100Gbps respectively, and (v) application-dependent dual-supply power savings up to 34%.","PeriodicalId":6418,"journal":{"name":"2010 IEEE International Solid-State Circuits Conference - (ISSCC)","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2010-03-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"40","resultStr":"{\"title\":\"A 320mV-to-1.2V on-die fine-grained reconfigurable fabric for DSP/media accelerators in 32nm CMOS\",\"authors\":\"A. Agarwal, S. Mathew, S. Hsu, M. Anders, Himanshu Kaul, F. Sheikh, R. Ramanarayanan, S. Srinivasan, R. Krishnamurthy, S. Borkar\",\"doi\":\"10.1109/ISSCC.2010.5433903\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Computationally intensive DSP/media processing applications require specialized hardware accelerators to enable higher energy-efficiency on microprocessor platforms. On-die reconfigurable arrays enable flexible accelerators with dynamic on-the-fly programmability while amortizing die area and time-to-market costs across a wide range of workloads. An ultra-low-voltage fine-grained reconfigurable fabric consisting of a hybrid configurable logic block (CLB) array with process/voltage/temperature (PVT) variation-tolerant register file (Fig. 18.2.1), targeted for on-die acceleration of DSP/media algorithms on power-constrained mobile microprocessors, is fabricated in 32nm high-k/metal-gate CMOS [1]. The CLB combines self-decoded look-up tables (LUTs) for random logic with reconfigurable arithmetic building blocks, hybrid 3∶2 compressors with integrated partial product generation, configurable adder/multiplier carry propagation and optimized CLB input/output multiplexers to achieve peak energy-efficiency of 2.6TOPS/W measured at 340mV, 50°C. The register file includes programmable stacked shared keepers and interruptible operation of both write memory cells and set-dominant latches (SDLs), improving Vcc-min by 300mV across PVT variations with a wide dynamic operating range of 320mV–1.2V, enabling simultaneous dynamic supply/frequency optimization across target workloads and power budgets. These features also achieve: (i) nominal CLB performance of 2.4GHz, 5.3mW measured at 1.0V, (ii) robust CLB functionality measured at 260mV, 27MHz (sub-threshold) consuming 12µW, (iii) scalable register file performance up to 8.2GHz, 125mW measured at 1.2V, 50°C with low-voltage near-threshold operation at 320mV, 252MHz consuming 430µW, (iv) 4-tap FIR filter, radix-2 FFT butterfly and 16b string-match algorithms with peak throughput of 2.1GSamples/s, 2.4GSamples/s and 100Gbps respectively, and (v) application-dependent dual-supply power savings up to 34%.\",\"PeriodicalId\":6418,\"journal\":{\"name\":\"2010 IEEE International Solid-State Circuits Conference - (ISSCC)\",\"volume\":null,\"pages\":null},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2010-03-18\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"40\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2010 IEEE International Solid-State Circuits Conference - (ISSCC)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ISSCC.2010.5433903\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2010 IEEE International Solid-State Circuits Conference - (ISSCC)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ISSCC.2010.5433903","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 40

摘要

计算密集型的DSP/媒体处理应用需要专门的硬件加速器,以便在微处理器平台上实现更高的能效。片上可重构阵列使灵活的加速器具有动态动态可编程性,同时在广泛的工作负载范围内分摊芯片面积和上市时间成本。一种超低电压细粒度可重构结构由混合可配置逻辑块(CLB)阵列和工艺/电压/温度(PVT)容差寄存器组成(图18.2.1),目标是在功率受限的移动微处理器上加速DSP/媒体算法,采用32nm高k/金属栅CMOS[1]制造。CLB将随机逻辑的自解码查找表(LUTs)与可重构算术构建块、集成部分积生成的混合型3∶2压缩器、可配置加法器/乘法器携带传播和优化的CLB输入/输出多路复用器结合在一起,在340mV、50°C下实现2.6TOPS/W的峰值能效。该寄存器文件包括可编程堆栈共享保存器和写入存储单元和集优势锁存器(sdl)的可中断操作,在320mV-1.2V的宽动态工作范围内,在PVT变化中将vc -min提高300mV,从而实现目标工作负载和功率预算之间的同步动态电源/频率优化。这些功能还可以实现:(i)标称CLB性能为2.4GHz,在1.0V下测量5.3mW, (ii)在260mV, 27MHz(亚阈值)下测量健壮的CLB功能,消耗12µW, (iii)可扩展的寄存器文件性能高达8.2GHz,在1.2V, 50°C下测量125mW,低压近阈值工作在320mV, 252MHz消耗430µW, (iv) 4抽头FIR滤波器,radix-2 FFT蝴蝶和16b字符串匹配算法,峰值吞吐量分别为2.1GSamples/s, 2.4GSamples/s和100Gbps。(v)基于应用的双电源节电高达34%。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
A 320mV-to-1.2V on-die fine-grained reconfigurable fabric for DSP/media accelerators in 32nm CMOS
Computationally intensive DSP/media processing applications require specialized hardware accelerators to enable higher energy-efficiency on microprocessor platforms. On-die reconfigurable arrays enable flexible accelerators with dynamic on-the-fly programmability while amortizing die area and time-to-market costs across a wide range of workloads. An ultra-low-voltage fine-grained reconfigurable fabric consisting of a hybrid configurable logic block (CLB) array with process/voltage/temperature (PVT) variation-tolerant register file (Fig. 18.2.1), targeted for on-die acceleration of DSP/media algorithms on power-constrained mobile microprocessors, is fabricated in 32nm high-k/metal-gate CMOS [1]. The CLB combines self-decoded look-up tables (LUTs) for random logic with reconfigurable arithmetic building blocks, hybrid 3∶2 compressors with integrated partial product generation, configurable adder/multiplier carry propagation and optimized CLB input/output multiplexers to achieve peak energy-efficiency of 2.6TOPS/W measured at 340mV, 50°C. The register file includes programmable stacked shared keepers and interruptible operation of both write memory cells and set-dominant latches (SDLs), improving Vcc-min by 300mV across PVT variations with a wide dynamic operating range of 320mV–1.2V, enabling simultaneous dynamic supply/frequency optimization across target workloads and power budgets. These features also achieve: (i) nominal CLB performance of 2.4GHz, 5.3mW measured at 1.0V, (ii) robust CLB functionality measured at 260mV, 27MHz (sub-threshold) consuming 12µW, (iii) scalable register file performance up to 8.2GHz, 125mW measured at 1.2V, 50°C with low-voltage near-threshold operation at 320mV, 252MHz consuming 430µW, (iv) 4-tap FIR filter, radix-2 FFT butterfly and 16b string-match algorithms with peak throughput of 2.1GSamples/s, 2.4GSamples/s and 100Gbps respectively, and (v) application-dependent dual-supply power savings up to 34%.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信