{"title":"A 4-way Matrix Multiply Unit for High Throughput Machine Learning Accelerator","authors":"Seung Chan Lee, T. Han","doi":"10.1109/ISOCC47750.2019.9078493","DOIUrl":null,"url":null,"abstract":"With the rapid growth of modern applications based on machine learning, neural network (NN) algorithm has been widely used in various fields. Accordingly, machine learning accelerators with high performance based on FPGA and ASIC design have become necessary. Machine learning accelerators generally include a matrix multiply unit that performs arithmetic. However, despite the development of dedicated hardware, some NN algorithms still suffer from performance degradation due to computation bounds in the matrix multiply units. Resolving the computation bound is crucial for high throughput machine learning accelerator. In this paper, we propose a 4-way matrix unit to resolve the computation bound by minimizing idle state operation logic and improving overall utilization. A 4-way matrix multiply unit resulted in an average throughput improvement of 29 percent and a 24 percent increase in the total area, comparing to the conventional systolic array-based matrix multiply unit.","PeriodicalId":113802,"journal":{"name":"2019 International SoC Design Conference (ISOCC)","volume":"77 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2019-10-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"3","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2019 International SoC Design Conference (ISOCC)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ISOCC47750.2019.9078493","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
A 4-way Matrix Multiply Unit for High Throughput Machine Learning Accelerator
With the rapid growth of modern applications based on machine learning, neural network (NN) algorithm has been widely used in various fields. Accordingly, machine learning accelerators with high performance based on FPGA and ASIC design have become necessary. Machine learning accelerators generally include a matrix multiply unit that performs arithmetic. However, despite the development of dedicated hardware, some NN algorithms still suffer from performance degradation due to computation bounds in the matrix multiply units. Resolving the computation bound is crucial for high throughput machine learning accelerator. In this paper, we propose a 4-way matrix unit to resolve the computation bound by minimizing idle state operation logic and improving overall utilization. A 4-way matrix multiply unit resulted in an average throughput improvement of 29 percent and a 24 percent increase in the total area, comparing to the conventional systolic array-based matrix multiply unit.