{"title":"A high-performance tensor computing unit for deep learning acceleration","authors":"Qiang Zhou , Tieli Sun , Taoran Shen , York Xue","doi":"10.1016/j.chip.2025.100145","DOIUrl":null,"url":null,"abstract":"<div><div>The increasing complexity of neural network applications has led to a demand for higher computational parallelism and more efficient synchronization in artificial intelligence (AI) chips. To achieve higher performance and lower power, a comprehensive and efficient approach is required to compile neural networks for implementation on dedicated hardware. Our first-generation deep learning accelerator, tensor computing unit, was presented with hardware and software solutions. It offered dedicated very long instruction words (VLIWs) instructions and multi-level repeatable direct memory access (DMA). The former lowers the instruction bandwidth requirement and makes it easier to parallelize the index and vector computations. The latter reduces the communication latency between the compute core and the asynchronous DMA, and also greatly alleviates the programming complexity. For operator implementation and optimization, the compiler-based data-flow generator and the instruction macro generator first produced a set of parameterized operators. Then, the tuner-configuration generator pruned the search space and the distributed tuner framework selected the best data-flow pattern and corresponding parameters. Our tensor computing unit supports all the convolution parameters with full-shape dimensions. It can readily select proper operators to achieve 96% of the chip peak performance under certain shapes and find the best performance implementation within limited power. The evaluation of a large number of convolution shapes on our tensor computing unit chip shows the generated operators significantly outperform the hand-written ones, achieving 9% higher normalized performance than CUDA according to the silicon data.</div></div>","PeriodicalId":100244,"journal":{"name":"Chip","volume":"4 2","pages":"Article 100145"},"PeriodicalIF":7.1000,"publicationDate":"2025-03-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Chip","FirstCategoryId":"1085","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S270947232500019X","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
The increasing complexity of neural network applications has led to a demand for higher computational parallelism and more efficient synchronization in artificial intelligence (AI) chips. To achieve higher performance and lower power, a comprehensive and efficient approach is required to compile neural networks for implementation on dedicated hardware. Our first-generation deep learning accelerator, tensor computing unit, was presented with hardware and software solutions. It offered dedicated very long instruction words (VLIWs) instructions and multi-level repeatable direct memory access (DMA). The former lowers the instruction bandwidth requirement and makes it easier to parallelize the index and vector computations. The latter reduces the communication latency between the compute core and the asynchronous DMA, and also greatly alleviates the programming complexity. For operator implementation and optimization, the compiler-based data-flow generator and the instruction macro generator first produced a set of parameterized operators. Then, the tuner-configuration generator pruned the search space and the distributed tuner framework selected the best data-flow pattern and corresponding parameters. Our tensor computing unit supports all the convolution parameters with full-shape dimensions. It can readily select proper operators to achieve 96% of the chip peak performance under certain shapes and find the best performance implementation within limited power. The evaluation of a large number of convolution shapes on our tensor computing unit chip shows the generated operators significantly outperform the hand-written ones, achieving 9% higher normalized performance than CUDA according to the silicon data.