Jung-Hoon Kim, Jaehoon Heo, Won-Ok Han, Jaeuk Kim, Joo-Young Kim
{"title":"SP-PIM:一种22.41TFLOPS/W, 8.81 epoch /Sec的基于本地错误预测的内存超级流水线处理加速器","authors":"Jung-Hoon Kim, Jaehoon Heo, Won-Ok Han, Jaeuk Kim, Joo-Young Kim","doi":"10.23919/VLSITechnologyandCir57934.2023.10185428","DOIUrl":null,"url":null,"abstract":"This paper presents SP-PIM that demonstrates real-time on-device learning based on the holistic, multi-level pipelining scheme enabled by local error prediction. It introduces the local error prediction unit to make the training algorithm pipelineable, while reducing computation overhead and overall external memory access based on power-of-two arithmetic operations and random weights. Its double-buffered PIM macro is designed for performing both forward propagation and gradient calculation, while the dual-sparsity-aware circuits exploit sparsity in activation and error. Finally, the 5.76mm2 SP-PIM chip fabricated in 28nm process achieves 8.81Epochs/Sec model training on chip with the state-of-the-art 560.6GFLOPS/mm2 area efficiency and 22.4TFLOPS/W power efficiency.","PeriodicalId":317958,"journal":{"name":"2023 IEEE Symposium on VLSI Technology and Circuits (VLSI Technology and Circuits)","volume":"47 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-06-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"SP-PIM: A 22.41TFLOPS/W, 8.81Epochs/Sec Super-Pipelined Processing-In-Memory Accelerator with Local Error Prediction for On-Device Learning\",\"authors\":\"Jung-Hoon Kim, Jaehoon Heo, Won-Ok Han, Jaeuk Kim, Joo-Young Kim\",\"doi\":\"10.23919/VLSITechnologyandCir57934.2023.10185428\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"This paper presents SP-PIM that demonstrates real-time on-device learning based on the holistic, multi-level pipelining scheme enabled by local error prediction. It introduces the local error prediction unit to make the training algorithm pipelineable, while reducing computation overhead and overall external memory access based on power-of-two arithmetic operations and random weights. Its double-buffered PIM macro is designed for performing both forward propagation and gradient calculation, while the dual-sparsity-aware circuits exploit sparsity in activation and error. Finally, the 5.76mm2 SP-PIM chip fabricated in 28nm process achieves 8.81Epochs/Sec model training on chip with the state-of-the-art 560.6GFLOPS/mm2 area efficiency and 22.4TFLOPS/W power efficiency.\",\"PeriodicalId\":317958,\"journal\":{\"name\":\"2023 IEEE Symposium on VLSI Technology and Circuits (VLSI Technology and Circuits)\",\"volume\":\"47 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2023-06-11\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2023 IEEE Symposium on VLSI Technology and Circuits (VLSI Technology and Circuits)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.23919/VLSITechnologyandCir57934.2023.10185428\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2023 IEEE Symposium on VLSI Technology and Circuits (VLSI Technology and Circuits)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.23919/VLSITechnologyandCir57934.2023.10185428","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
SP-PIM: A 22.41TFLOPS/W, 8.81Epochs/Sec Super-Pipelined Processing-In-Memory Accelerator with Local Error Prediction for On-Device Learning
This paper presents SP-PIM that demonstrates real-time on-device learning based on the holistic, multi-level pipelining scheme enabled by local error prediction. It introduces the local error prediction unit to make the training algorithm pipelineable, while reducing computation overhead and overall external memory access based on power-of-two arithmetic operations and random weights. Its double-buffered PIM macro is designed for performing both forward propagation and gradient calculation, while the dual-sparsity-aware circuits exploit sparsity in activation and error. Finally, the 5.76mm2 SP-PIM chip fabricated in 28nm process achieves 8.81Epochs/Sec model training on chip with the state-of-the-art 560.6GFLOPS/mm2 area efficiency and 22.4TFLOPS/W power efficiency.