Jung-Hoon Kim, Jaehoon Heo, Won-Ok Han, Jaeuk Kim, Joo-Young Kim
{"title":"SP-PIM: A 22.41TFLOPS/W, 8.81Epochs/Sec Super-Pipelined Processing-In-Memory Accelerator with Local Error Prediction for On-Device Learning","authors":"Jung-Hoon Kim, Jaehoon Heo, Won-Ok Han, Jaeuk Kim, Joo-Young Kim","doi":"10.23919/VLSITechnologyandCir57934.2023.10185428","DOIUrl":null,"url":null,"abstract":"This paper presents SP-PIM that demonstrates real-time on-device learning based on the holistic, multi-level pipelining scheme enabled by local error prediction. It introduces the local error prediction unit to make the training algorithm pipelineable, while reducing computation overhead and overall external memory access based on power-of-two arithmetic operations and random weights. Its double-buffered PIM macro is designed for performing both forward propagation and gradient calculation, while the dual-sparsity-aware circuits exploit sparsity in activation and error. Finally, the 5.76mm2 SP-PIM chip fabricated in 28nm process achieves 8.81Epochs/Sec model training on chip with the state-of-the-art 560.6GFLOPS/mm2 area efficiency and 22.4TFLOPS/W power efficiency.","PeriodicalId":317958,"journal":{"name":"2023 IEEE Symposium on VLSI Technology and Circuits (VLSI Technology and Circuits)","volume":"47 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-06-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2023 IEEE Symposium on VLSI Technology and Circuits (VLSI Technology and Circuits)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.23919/VLSITechnologyandCir57934.2023.10185428","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
This paper presents SP-PIM that demonstrates real-time on-device learning based on the holistic, multi-level pipelining scheme enabled by local error prediction. It introduces the local error prediction unit to make the training algorithm pipelineable, while reducing computation overhead and overall external memory access based on power-of-two arithmetic operations and random weights. Its double-buffered PIM macro is designed for performing both forward propagation and gradient calculation, while the dual-sparsity-aware circuits exploit sparsity in activation and error. Finally, the 5.76mm2 SP-PIM chip fabricated in 28nm process achieves 8.81Epochs/Sec model training on chip with the state-of-the-art 560.6GFLOPS/mm2 area efficiency and 22.4TFLOPS/W power efficiency.