基于22nm CMOS的280mv -1.1 v 256b可重构SIMD矢量置换引擎

S. Hsu, A. Agarwal, M. Anders, S. Mathew, Himanshu Kaul, F. Sheikh, R. Krishnamurthy
{"title":"基于22nm CMOS的280mv -1.1 v 256b可重构SIMD矢量置换引擎","authors":"S. Hsu, A. Agarwal, M. Anders, S. Mathew, Himanshu Kaul, F. Sheikh, R. Krishnamurthy","doi":"10.1109/ISSCC.2012.6176966","DOIUrl":null,"url":null,"abstract":"Energy-efficient SIMD permutation operations are key for maximizing high-performance microprocessor vector datapath utilization in multimedia, graphics, and signal processing workloads [1-3]. A wide SIMD vector permutation engine is required to achieve high-throughput data rearrangement operations on large data sets, with scaled supply voltages to deliver high energy efficiency. An ultra-low-voltage reconfigurable 4-way to 32-way SIMD vector permutation engine consisting of a 32-entry × 256b 3-read/1-write ported register file with a 256b byte-wise any-to-any permute crossbar for 2-dimensional shuffle is fabricated in 22nm CMOS. The register file integrates a vertical shuffle across multiple entries into read/write operations, and includes clockless static reads with shared P/N dual-ended transmission gate (DETG) writes, improving register file VMIN by 250mV across PVT variations with a wide dynamic operating range of 280mV-1.1V. The permute crossbar implements an interleaved folded byte-wise multiplexer layout forming an any-to-any fully-connected tree to perform a horizontal shuffle with permute accumulate circuits, and includes vector flip-flops, stacked min-delay buffers, shared gates to average min-sized transistor variation, and ultra-low-voltage split-output (ULVS) level shifters improving logic VMIN by 150mV, while enabling peak energy efficiency of 585GOPS/W measured at 260mV, 50°C. The permutation engine occupies a dense layout of 0.048mm2 (Fig. 10.1.7) while achieving: (i) nominal register file performance of 1.8GHz, 106mW measured at 0.9V, 50°C; (ii) robust register file functionality measured down to 280mV (subthreshold) with peak energy efficiency of 154GOPS/W; (iii) scalable permute crossbar performance of 2.9GHz, 69mW measured at 1.1V, 50°C with deep sub-threshold operation at 240mV, 10MHz consuming 19μW; and (iv) a 64b 4×4 matrix transpose algorithm with 53% energy savings and 42% improved peak throughput of 263Gbps measured at 1.8GHz, 0.9V.","PeriodicalId":255282,"journal":{"name":"2012 IEEE International Solid-State Circuits Conference","volume":"12 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2012-04-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"46","resultStr":"{\"title\":\"A 280mV-to-1.1V 256b reconfigurable SIMD vector permutation engine with 2-dimensional shuffle in 22nm CMOS\",\"authors\":\"S. Hsu, A. Agarwal, M. Anders, S. Mathew, Himanshu Kaul, F. Sheikh, R. Krishnamurthy\",\"doi\":\"10.1109/ISSCC.2012.6176966\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Energy-efficient SIMD permutation operations are key for maximizing high-performance microprocessor vector datapath utilization in multimedia, graphics, and signal processing workloads [1-3]. A wide SIMD vector permutation engine is required to achieve high-throughput data rearrangement operations on large data sets, with scaled supply voltages to deliver high energy efficiency. An ultra-low-voltage reconfigurable 4-way to 32-way SIMD vector permutation engine consisting of a 32-entry × 256b 3-read/1-write ported register file with a 256b byte-wise any-to-any permute crossbar for 2-dimensional shuffle is fabricated in 22nm CMOS. The register file integrates a vertical shuffle across multiple entries into read/write operations, and includes clockless static reads with shared P/N dual-ended transmission gate (DETG) writes, improving register file VMIN by 250mV across PVT variations with a wide dynamic operating range of 280mV-1.1V. The permute crossbar implements an interleaved folded byte-wise multiplexer layout forming an any-to-any fully-connected tree to perform a horizontal shuffle with permute accumulate circuits, and includes vector flip-flops, stacked min-delay buffers, shared gates to average min-sized transistor variation, and ultra-low-voltage split-output (ULVS) level shifters improving logic VMIN by 150mV, while enabling peak energy efficiency of 585GOPS/W measured at 260mV, 50°C. The permutation engine occupies a dense layout of 0.048mm2 (Fig. 10.1.7) while achieving: (i) nominal register file performance of 1.8GHz, 106mW measured at 0.9V, 50°C; (ii) robust register file functionality measured down to 280mV (subthreshold) with peak energy efficiency of 154GOPS/W; (iii) scalable permute crossbar performance of 2.9GHz, 69mW measured at 1.1V, 50°C with deep sub-threshold operation at 240mV, 10MHz consuming 19μW; and (iv) a 64b 4×4 matrix transpose algorithm with 53% energy savings and 42% improved peak throughput of 263Gbps measured at 1.8GHz, 0.9V.\",\"PeriodicalId\":255282,\"journal\":{\"name\":\"2012 IEEE International Solid-State Circuits Conference\",\"volume\":\"12 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2012-04-03\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"46\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2012 IEEE International Solid-State Circuits Conference\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ISSCC.2012.6176966\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2012 IEEE International Solid-State Circuits Conference","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ISSCC.2012.6176966","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 46

摘要

节能SIMD排列操作是在多媒体、图形和信号处理工作负载中最大化高性能微处理器矢量数据路径利用率的关键[1-3]。为了在大型数据集上实现高吞吐量的数据重排操作,需要一个宽SIMD矢量排列引擎,并具有可缩放的电源电压以提供高能效。采用22nm CMOS工艺制备了一种超低电压可重构的4路到32路SIMD矢量置换引擎,该引擎由一个32入口× 256b 3读1写端口寄存器文件和一个256b字节任意对任意置换交叉条组成,用于二维置换。该寄存器文件集成了跨多个条目的垂直shuffle到读/写操作中,并包括无时钟静态读取和共享P/N双端传输门(DETG)写入,通过PVT变化将寄存器文件VMIN提高了250mV,动态工作范围为280mV-1.1V。该置换交叉棒实现了交错折叠字节多路复用器布局,形成任意到任意全连接树,通过置换累积电路执行水平置换,包括矢量触发器、堆叠最小延迟缓冲器、平均最小尺寸晶体管变化的共享门,以及将逻辑VMIN提高150mV的超低电压分频输出(ULVS)电平移位器,同时在260mV、50°C下实现585GOPS/W的峰值能量效率。排列引擎占据0.048mm2的密集布局(图10.1.7),同时实现:(i)在0.9V, 50°C下测量的标称寄存器文件性能为1.8GHz, 106mW;(ii)稳健的寄存器文件功能,低至280mV(亚阈值),峰值能量效率为154GOPS/W;(iii)在1.1V、50°C、240mV、10MHz、功耗19μW的深度亚阈值工作条件下,测量2.9GHz、69mW的可扩展permute crossbar性能;(iv)一种64b 4×4矩阵转置算法,在1.8GHz, 0.9V下的峰值吞吐量为263Gbps,节能53%,提高42%。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
A 280mV-to-1.1V 256b reconfigurable SIMD vector permutation engine with 2-dimensional shuffle in 22nm CMOS
Energy-efficient SIMD permutation operations are key for maximizing high-performance microprocessor vector datapath utilization in multimedia, graphics, and signal processing workloads [1-3]. A wide SIMD vector permutation engine is required to achieve high-throughput data rearrangement operations on large data sets, with scaled supply voltages to deliver high energy efficiency. An ultra-low-voltage reconfigurable 4-way to 32-way SIMD vector permutation engine consisting of a 32-entry × 256b 3-read/1-write ported register file with a 256b byte-wise any-to-any permute crossbar for 2-dimensional shuffle is fabricated in 22nm CMOS. The register file integrates a vertical shuffle across multiple entries into read/write operations, and includes clockless static reads with shared P/N dual-ended transmission gate (DETG) writes, improving register file VMIN by 250mV across PVT variations with a wide dynamic operating range of 280mV-1.1V. The permute crossbar implements an interleaved folded byte-wise multiplexer layout forming an any-to-any fully-connected tree to perform a horizontal shuffle with permute accumulate circuits, and includes vector flip-flops, stacked min-delay buffers, shared gates to average min-sized transistor variation, and ultra-low-voltage split-output (ULVS) level shifters improving logic VMIN by 150mV, while enabling peak energy efficiency of 585GOPS/W measured at 260mV, 50°C. The permutation engine occupies a dense layout of 0.048mm2 (Fig. 10.1.7) while achieving: (i) nominal register file performance of 1.8GHz, 106mW measured at 0.9V, 50°C; (ii) robust register file functionality measured down to 280mV (subthreshold) with peak energy efficiency of 154GOPS/W; (iii) scalable permute crossbar performance of 2.9GHz, 69mW measured at 1.1V, 50°C with deep sub-threshold operation at 240mV, 10MHz consuming 19μW; and (iv) a 64b 4×4 matrix transpose algorithm with 53% energy savings and 42% improved peak throughput of 263Gbps measured at 1.8GHz, 0.9V.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信