Yang Wang, Yubin Qin, Dazheng Deng, Jingchuang Wei, Yang Zhou, Yuanqi Fan, Tianbao Chen, Hao-Fen Sun, Leibo Liu, Shaojun Wei, S. Yin
{"title":"A 28nm 27.5TOPS/W Approximate-Computing-Based Transformer Processor with Asymptotic Sparsity Speculating and Out-of-Order Computing","authors":"Yang Wang, Yubin Qin, Dazheng Deng, Jingchuang Wei, Yang Zhou, Yuanqi Fan, Tianbao Chen, Hao-Fen Sun, Leibo Liu, Shaojun Wei, S. Yin","doi":"10.1109/ISSCC42614.2022.9731686","DOIUrl":"https://doi.org/10.1109/ISSCC42614.2022.9731686","url":null,"abstract":"Recently, Transformer-based models have achieved tremendous success in many AI fields, from NLP to CV, using the attention mechanism [1]–[3]. This mechanism captures the global correlations of input by indicating every two tokens' relevance with attention scores and uses normalized scores, defined as attention probabilities, to weight all input tokens to obtain output tokens with a global receptive field. A Transformer model consists of multiple blocks, named multi-head, working with the attention mechanism. Figure 29.2.1 details the computation of an attention block with query (Q), key (K), and value-matrix (V), computed by tokens and weight matrices. First, Q is multiplied by KT to generate the attention score matrix. The scores in each row, represented as $mathrm{X}_{mathrm i}$, indicate a token's relevance with all others. Second, the row-wise softmax with inputs of $mathrm{X}_{mathrm{i}}-mathrm{X}_{max}$ normalizes attention scores to probabilities (P), expanding the large scores and reducing the small scores exponentially. Finally, probabilities are quantized and then multiplied by V to produce the output. Each output token is a weighted sum of all input tokens, where the strongly related tokens have large weight values. Global attention-based models achieve 20.4% higher accuracy than LSTM for NLP and 15.1% higher accuracy than ResNet-152 for classification.","PeriodicalId":6830,"journal":{"name":"2022 IEEE International Solid- State Circuits Conference (ISSCC)","volume":"51 1","pages":"1-3"},"PeriodicalIF":0.0,"publicationDate":"2022-02-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"76117831","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Daewoong Lee, Hye-Jung Kwon, Daehyun Kwon, Jaehyeok Baek, C. Cho, Sanghoon Kim, Donggun An, C. Chang, Unhak Lim, Jiyeon Im, Wonju Sung, Hye-Ran Kim, Sun-Young Park, Hyoung-Ju Kim, Ho-Seok Seol, Juhwan Kim, Junabum Shin, Kil Y. Kang, Yong-Hun Kim, Sooyoung Kim, Wansoo Park, Seok-Jung Kim, ChanYong Lee, Seungseob Lee, T. Park, C. Oh, H. Ban, Hyungjong Ko, H. Song, T. Oh, Sang-Jun Hwang, Kyungseob Oh, J. Choi, Jooyoung Lee
{"title":"A 16Gb 27Gb/s/pin T-coil based GDDR6 DRAM with Merged-MUX TX, Optimized WCK Operation, and Alternative-Data-Bus","authors":"Daewoong Lee, Hye-Jung Kwon, Daehyun Kwon, Jaehyeok Baek, C. Cho, Sanghoon Kim, Donggun An, C. Chang, Unhak Lim, Jiyeon Im, Wonju Sung, Hye-Ran Kim, Sun-Young Park, Hyoung-Ju Kim, Ho-Seok Seol, Juhwan Kim, Junabum Shin, Kil Y. Kang, Yong-Hun Kim, Sooyoung Kim, Wansoo Park, Seok-Jung Kim, ChanYong Lee, Seungseob Lee, T. Park, C. Oh, H. Ban, Hyungjong Ko, H. Song, T. Oh, Sang-Jun Hwang, Kyungseob Oh, J. Choi, Jooyoung Lee","doi":"10.1109/ISSCC42614.2022.9731614","DOIUrl":"https://doi.org/10.1109/ISSCC42614.2022.9731614","url":null,"abstract":"Graphic DRAMs have been developed to increase maximum I/O interface speeds to satisfy the demand of high-performance graphic applications [1]–[5]. Recently, PAM4 signaling was utilized to increase the I/O bandwidth up to 22Gb/s/pin [5]. However, the reduced voltage margin of PAM4, compared to NRZ, complicates circuit design; margins also become worse with a reduced power supply. This paper achieves 27Gb/s in NRZ, a 1.5× speed enhancement, by improving on previous GDDR6 [3]. A T-coil is designed, for the first time in a DRAM process, so that the maximum operation frequency is increased. The proposed merged-MUX TX increases the maximum speed and reduces power and area consumption. A quad-skew training technique enables a wider clock sampling margin for WCK: up to 3ps, which is 8.1% of 1UI at 27Gbp/s/pin. Furthermore, a dual-mode frequency divider allows a wide-range operation from sub-1Gb/s/pin to 27Gb/s/pin. An alternative-data-bus (ADB) is proposed to solve the frequency limit of the data bus.","PeriodicalId":6830,"journal":{"name":"2022 IEEE International Solid- State Circuits Conference (ISSCC)","volume":"1 1","pages":"446-448"},"PeriodicalIF":0.0,"publicationDate":"2022-02-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"76170684","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A 32nA Fully Autonomous Multi-Input Single-Inductor Multi-Output Energy-Harvesting and Power-Management Platform with 1.2×105 Dynamic Range, Integrated MPPT, and Multi-Modal Cold Start-Up","authors":"Shuo Li, Xinjian Liu, B. Calhoun","doi":"10.1109/ISSCC42614.2022.9731732","DOIUrl":"https://doi.org/10.1109/ISSCC42614.2022.9731732","url":null,"abstract":"Energy harvesting and power management units (EHPMUs) are gaining popularity for self-powered Internet-of-Things (loT) applications due to their ability of extracting ambient energy and powering load circuits through a single block. Among all EHPMU architectures, the multi-input single-inductor multi-output (MISIMO) [1]–[6] has the benefits of small form factor, high efficiency, extracting energy from multi-modal energy sources, and powering different types of loads. Self-powered loT applications also require the EHPMUs to have ultra-low quiescent power, wide dynamic range, and autonomous features to support their deployment without any battery. However, previous EHPMUs either consume too much power [1]–[3] or only provide a small dynamic range [4], [5]. They also suffer from two-stage power delivery causing cascaded power loss [1], [5] and lack of essential components such as voltage references [6] for a fully deployable solution. To overcome all these challenges, in this work, we propose a fully autonomous MISIMO EHPMU platform that can extract energy from three energy harvesters with both AC and DC modalities and provide four custom voltage rails together with on-chip maximum power-point tracking (MPPT) and multi-modal cold start-up circuits. This EHPMU achieves 32nA quiescent current, 1.2x1 05 dynamic range, 3.2x energy-extraction gain for piezoelectric energy harvesting, and 80% efficiency when delivering $1mu mathrm{A}$ output current.","PeriodicalId":6830,"journal":{"name":"2022 IEEE International Solid- State Circuits Conference (ISSCC)","volume":"377 1","pages":"1-3"},"PeriodicalIF":0.0,"publicationDate":"2022-02-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"76437299","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Memories of the Solid-State Circuits Council Transition to Solid-State Circuits Society","authors":"","doi":"10.1109/isscc42614.2022.9731593","DOIUrl":"https://doi.org/10.1109/isscc42614.2022.9731593","url":null,"abstract":"","PeriodicalId":6830,"journal":{"name":"2022 IEEE International Solid- State Circuits Conference (ISSCC)","volume":"30 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2022-02-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"76446881","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Session 20 Overview: Body and Brain Interfaces","authors":"","doi":"10.1109/isscc42614.2022.9731615","DOIUrl":"https://doi.org/10.1109/isscc42614.2022.9731615","url":null,"abstract":"","PeriodicalId":6830,"journal":{"name":"2022 IEEE International Solid- State Circuits Conference (ISSCC)","volume":"215 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2022-02-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"79597488","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A 6.5-to-10GHz IEEE 802.15.4/4z-Compliant 1T3R UWB Transceiver","authors":"Run Chen, Yuzhong Xiao, Yonggang Chen, Hua Xu, YU Peng, Qi Peng, Xian Li, Xiaofeng Guo, Jianlong Huang, Nansong Li, Xueqing Hu, Rongde Ou, Wenzhe Liu, Bei Chen, Wen Zhang, Xiaofeng Xin, Bingcai Zhao, Zhenqi Chen","doi":"10.1109/ISSCC42614.2022.9731638","DOIUrl":"https://doi.org/10.1109/ISSCC42614.2022.9731638","url":null,"abstract":"Ultra-wideband (UWB) technology differentiates itself from other wireless connectivity techniques, such as WiFi and Bluetooth, by providing centimeter-level location accuracy due to its impulse-radio operation. This unique feature draws much interest in smartphones, smart homes, intelligent vehicles, AR/VR, and loT applications since accurate ranging/positioning adds a new dimension to existing wireless communication functions. The recently released IEEE 802.15.4z enhances UWB PHYs to increase the integrity and accuracy of ranging measurement and specifies a security extension for secure ranging [1]. Not many prior works have reported standard-compliant system-level UWB solutions except that some building blocks were discussed, such as coherent transmitters [2], [3]. An integrated UWB transceiver was reported in [4], which contains one transmitter and one receiver. However, the 1T1R architecture must switch between antennas to enable a phase-difference-of-arrival (PDoA) measurement, a primary use case for smartphone applications. Additional switches bring more insertion loss at the RF front-end. Moreover, the ranging time increases since it must measure multiple times, introducing accumulated timing error that significantly degrades the positioning accuracy.","PeriodicalId":6830,"journal":{"name":"2022 IEEE International Solid- State Circuits Conference (ISSCC)","volume":"28 1","pages":"396-398"},"PeriodicalIF":0.0,"publicationDate":"2022-02-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"86198847","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Ted Pekny, Luyen Vu, Jeff Tsai, Dheeraj Srinivasan, E. Yu, J. Pabustan, Joe Xu, Srinivasarao Deshmukh, Kim-Fung Chan, Michael Piccardi, K. Xu, Guan Wang, K. Shakeri, Vipul Patel, T. Iwasaki, Tongji Wang, Padma Musunuri, Carl Gu, A. Mohammadzadeh, Ali Ghalam, V. Moschiano, T. Vali, Jae-Kwan Park, June Lee, R. Ghodsi
{"title":"A 1-Tb Density 4b/Cell 3D-NAND Flash on 176-Tier Technology with 4-Independent Planes for Read using CMOS-Under-the-Array","authors":"Ted Pekny, Luyen Vu, Jeff Tsai, Dheeraj Srinivasan, E. Yu, J. Pabustan, Joe Xu, Srinivasarao Deshmukh, Kim-Fung Chan, Michael Piccardi, K. Xu, Guan Wang, K. Shakeri, Vipul Patel, T. Iwasaki, Tongji Wang, Padma Musunuri, Carl Gu, A. Mohammadzadeh, Ali Ghalam, V. Moschiano, T. Vali, Jae-Kwan Park, June Lee, R. Ghodsi","doi":"10.1109/ISSCC42614.2022.9731691","DOIUrl":"https://doi.org/10.1109/ISSCC42614.2022.9731691","url":null,"abstract":"This paper presents a1Tb 4b/cell 3D-NAND-Flash memory on a 176-tier technology with a 14.7Gb/mm2 bit density. The die is organized using a 4-plane architecture for multiplane operations with a 16KB page size. The 1×4 plane architecture improves both program and read throughput, without increasing the die size. Periphery circuitry and page buffers are placed under the array using 5th-generation CMOS under array (CuA) technology. To improve random read performance, a faster read is provided with a read concurrency feature: allowing four independent multiplane page read addresses. The 4b/cell capability is reached using negative voltage for an expanded window in the negative region and a positive SRC bias, both of which aid in extended reliability. The programming operation is based on a 16–16 programming algorithm. The I/O transfer speed is 1600MT/s in ONFl4.2. The 3D-NAND Flash technology has improved significantly in its performance and reliability, enabling a design of a high density of 4b/cell (QLC) device.","PeriodicalId":6830,"journal":{"name":"2022 IEEE International Solid- State Circuits Conference (ISSCC)","volume":"348 1","pages":"1-3"},"PeriodicalIF":0.0,"publicationDate":"2022-02-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"77322321","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"ReckOn: A 28nm Sub-mm2 Task-Agnostic Spiking Recurrent Neural Network Processor Enabling On-Chip Learning over Second-Long Timescales","authors":"C. Frenkel, G. Indiveri","doi":"10.1109/ISSCC42614.2022.9731734","DOIUrl":"https://doi.org/10.1109/ISSCC42614.2022.9731734","url":null,"abstract":"The robustness of autonomous inference-only devices deployed in the real world is limited by data distribution changes induced by different users, environments, and task requirements. This challenge calls for the development of edge devices with an always-on adaptation to their target ecosystems. However, the memory requirements of conventional neural-network training algorithms scale with the temporal depth of the data being processed, which is not compatible with the constrained power and area budgets at the edge. For this reason, previous works demonstrating end-to-end on-chip learning without external memory were restricted to the processing of static data such as images [1]–[4], or to instantaneous decisions involving no memory of the past, e.g. obstacle avoidance in mobile robots [5]. The ability to learn short-to-long-term temporal dependencies on-chip is a missing enabler for robust autonomous edge devices in applications such as gesture recognition, speech processing, and cognitive robotics.","PeriodicalId":6830,"journal":{"name":"2022 IEEE International Solid- State Circuits Conference (ISSCC)","volume":"107 1","pages":"1-3"},"PeriodicalIF":0.0,"publicationDate":"2022-02-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"86987488","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
D. Frank, S. Chakraborty, K. Tien, Pat Rosno, T. Fox, M. Yeck, J. Glick, R. Robertazzi, R. Richetta, J. Bulzacchelli, Daniel Ramirez, Dereje Yilma, Andrew Davies, R. Joshi, Shawn D. Chambers, S. Lekuch, K. Inoue, D. Underwood, Dorothy Wisnieff, C. Baks, D. Bethune, John Timmerwilke, B. Johnson, Brian P. Gaucher, D. Friedman
{"title":"A Cryo-CMOS Low-Power Semi-Autonomous Qubit State Controller in 14nm FinFET Technology","authors":"D. Frank, S. Chakraborty, K. Tien, Pat Rosno, T. Fox, M. Yeck, J. Glick, R. Robertazzi, R. Richetta, J. Bulzacchelli, Daniel Ramirez, Dereje Yilma, Andrew Davies, R. Joshi, Shawn D. Chambers, S. Lekuch, K. Inoue, D. Underwood, Dorothy Wisnieff, C. Baks, D. Bethune, John Timmerwilke, B. Johnson, Brian P. Gaucher, D. Friedman","doi":"10.1109/ISSCC42614.2022.9731538","DOIUrl":"https://doi.org/10.1109/ISSCC42614.2022.9731538","url":null,"abstract":"Error-corrected quantum computing is expected to require at least 105 to 106 physical qubits. Superconducting transmons, which are promising qubit candidates for scaled quantum computing systems, typically require individually tailored RF pulses in the 4-to-6 GHz range to manipulate their states, so scaling to 106 qubits presents an enormous challenge. Providing a control line for every qubit from room temperature (RT) to the 10mK environment does not appear to be viable for a 106 qubit system due to multiple factors, including RF loss, mechanical congestion, heat load, and connector unreliability. TDM cannot be used to reduce the number of control lines since all of the qubits may need to be activated at once (e.g., during quantum error correction (QEC) cycles). FDM has been proposed but is undesirable because extra tones can give rise to unwanted qubit excitations.","PeriodicalId":6830,"journal":{"name":"2022 IEEE International Solid- State Circuits Conference (ISSCC)","volume":"14 1","pages":"360-362"},"PeriodicalIF":0.0,"publicationDate":"2022-02-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"87147731","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Jun-Seok Park, Changsoo Park, S. Kwon, Hyeong-Seok Kim, Taeho Jeon, Yesung Kang, Heonsoo Lee, Dongwoo Lee, James Kim, YoungJong Lee, Sangkyu Park, Jun-Woo Jang, Sanghyuck Ha, MinSeong Kim, Jihoon Bang, Sukhwan Lim, Inyup Kang
{"title":"A Multi-Mode 8K-MAC HW-Utilization-Aware Neural Processing Unit with a Unified Multi-Precision Datapath in 4nm Flagship Mobile SoC","authors":"Jun-Seok Park, Changsoo Park, S. Kwon, Hyeong-Seok Kim, Taeho Jeon, Yesung Kang, Heonsoo Lee, Dongwoo Lee, James Kim, YoungJong Lee, Sangkyu Park, Jun-Woo Jang, Sanghyuck Ha, MinSeong Kim, Jihoon Bang, Sukhwan Lim, Inyup Kang","doi":"10.1109/ISSCC42614.2022.9731639","DOIUrl":"https://doi.org/10.1109/ISSCC42614.2022.9731639","url":null,"abstract":"Recent work on neural-network accelerators has focused on obtaining high performance in order to meet the needs of real-time applications with vastly different performance requirements, including high precision computation, efficiency for various Deep Learning (DL) layer types, and extremely low power to run always-on applications. Applying a single mode or datatype uniformly across these different scenarios would be less efficient than using different operating modes according to different operating scenarios. For example, super-resolution typically requires FP16 precision for higher image quality, while NNs for face-detection need only INT4 or INT8 precision. Using higher precision than INT8 for face detection would result in higher power consumption. A highly programmable NPU capable of covering the diverse workloads observed in the real world is therefore desired. In this paper, we present a neural processing unit (NPU) optimized with the following features: i) reconfigurable data prefetching and operational flow for high compute utilization, ii) multi-precision MACs supporting INT4,8,16, and float16, iii) a dynamic operation mode to cover extremely low-power or low-latency requirements. These features provide the flexibility needed by real world applications within the power constraints of various product domains.","PeriodicalId":6830,"journal":{"name":"2022 IEEE International Solid- State Circuits Conference (ISSCC)","volume":"9 1","pages":"246-248"},"PeriodicalIF":0.0,"publicationDate":"2022-02-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"87233334","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}