{"title":"Analysis of Quantization Across DNN Accelerator Architecture Paradigms","authors":"Tom Glint, C. Jha, M. Awasthi, Joycee Mekie","doi":"10.23919/DATE56975.2023.10136899","DOIUrl":null,"url":null,"abstract":"Quantization techniques promise to significantly reduce the latency, energy, and area associated with multiplier hardware. This work, to the best of our knowledge, for the first time, shows the system-level impact of quantization on SOTA DNN accelerators from different digital accelerator paradigms. Based on the placement of data and compute site, we identify SOTA designs from Conventional Hardware Accelerators (CHA), Near Data Processors (NDP), and Processing-in-Memory (PIM) paradigms and show the impact of quantization when inferencing CNN and Fully Connected Layer (FCL) workloads. We show that the 32-bit implementation of SOTA from PIM consumes less energy than the 8-bit implementation of SOTA from CHA for FCL, while the trend reverses for CNN workloads. Further, PIM has stable latency while scaling the word size while CHA and NDP suffer 20% to $2\\times$ slow down for doubling word size.","PeriodicalId":340349,"journal":{"name":"2023 Design, Automation & Test in Europe Conference & Exhibition (DATE)","volume":"38 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2023 Design, Automation & Test in Europe Conference & Exhibition (DATE)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.23919/DATE56975.2023.10136899","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
Quantization techniques promise to significantly reduce the latency, energy, and area associated with multiplier hardware. This work, to the best of our knowledge, for the first time, shows the system-level impact of quantization on SOTA DNN accelerators from different digital accelerator paradigms. Based on the placement of data and compute site, we identify SOTA designs from Conventional Hardware Accelerators (CHA), Near Data Processors (NDP), and Processing-in-Memory (PIM) paradigms and show the impact of quantization when inferencing CNN and Fully Connected Layer (FCL) workloads. We show that the 32-bit implementation of SOTA from PIM consumes less energy than the 8-bit implementation of SOTA from CHA for FCL, while the trend reverses for CNN workloads. Further, PIM has stable latency while scaling the word size while CHA and NDP suffer 20% to $2\times$ slow down for doubling word size.