A high-performance tensor computing unit for deep learning acceleration

IF 7.1

Chip Pub Date : 2025-03-28 DOI:10.1016/j.chip.2025.100145

Qiang Zhou , Tieli Sun , Taoran Shen , York Xue

{"title":"A high-performance tensor computing unit for deep learning acceleration","authors":"Qiang Zhou , Tieli Sun , Taoran Shen , York Xue","doi":"10.1016/j.chip.2025.100145","DOIUrl":null,"url":null,"abstract":"<div><div>The increasing complexity of neural network applications has led to a demand for higher computational parallelism and more efficient synchronization in artificial intelligence (AI) chips. To achieve higher performance and lower power, a comprehensive and efficient approach is required to compile neural networks for implementation on dedicated hardware. Our first-generation deep learning accelerator, tensor computing unit, was presented with hardware and software solutions. It offered dedicated very long instruction words (VLIWs) instructions and multi-level repeatable direct memory access (DMA). The former lowers the instruction bandwidth requirement and makes it easier to parallelize the index and vector computations. The latter reduces the communication latency between the compute core and the asynchronous DMA, and also greatly alleviates the programming complexity. For operator implementation and optimization, the compiler-based data-flow generator and the instruction macro generator first produced a set of parameterized operators. Then, the tuner-configuration generator pruned the search space and the distributed tuner framework selected the best data-flow pattern and corresponding parameters. Our tensor computing unit supports all the convolution parameters with full-shape dimensions. It can readily select proper operators to achieve 96% of the chip peak performance under certain shapes and find the best performance implementation within limited power. The evaluation of a large number of convolution shapes on our tensor computing unit chip shows the generated operators significantly outperform the hand-written ones, achieving 9% higher normalized performance than CUDA according to the silicon data.</div></div>","PeriodicalId":100244,"journal":{"name":"Chip","volume":"4 2","pages":"Article 100145"},"PeriodicalIF":7.1000,"publicationDate":"2025-03-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Chip","FirstCategoryId":"1085","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S270947232500019X","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

The increasing complexity of neural network applications has led to a demand for higher computational parallelism and more efficient synchronization in artificial intelligence (AI) chips. To achieve higher performance and lower power, a comprehensive and efficient approach is required to compile neural networks for implementation on dedicated hardware. Our first-generation deep learning accelerator, tensor computing unit, was presented with hardware and software solutions. It offered dedicated very long instruction words (VLIWs) instructions and multi-level repeatable direct memory access (DMA). The former lowers the instruction bandwidth requirement and makes it easier to parallelize the index and vector computations. The latter reduces the communication latency between the compute core and the asynchronous DMA, and also greatly alleviates the programming complexity. For operator implementation and optimization, the compiler-based data-flow generator and the instruction macro generator first produced a set of parameterized operators. Then, the tuner-configuration generator pruned the search space and the distributed tuner framework selected the best data-flow pattern and corresponding parameters. Our tensor computing unit supports all the convolution parameters with full-shape dimensions. It can readily select proper operators to achieve 96% of the chip peak performance under certain shapes and find the best performance implementation within limited power. The evaluation of a large number of convolution shapes on our tensor computing unit chip shows the generated operators significantly outperform the hand-written ones, achieving 9% higher normalized performance than CUDA according to the silicon data.

查看原文本刊更多论文

用于深度学习加速的高性能张量计算单元

随着神经网络应用的日益复杂，人工智能（AI）芯片对更高的计算并行性和更高效的同步提出了要求。为了实现更高的性能和更低的功耗，需要一种全面有效的方法来编译神经网络以在专用硬件上实现。我们的第一代深度学习加速器张量计算单元给出了硬件和软件解决方案。它提供专用的超长指令字（VLIWs）指令和多级可重复直接存储器访问（DMA）。前者降低了指令带宽要求，使索引和矢量计算更容易并行化。后者减少了计算核心与异步DMA之间的通信延迟，也大大减轻了编程的复杂性。为了实现和优化运算符，基于编译器的数据流生成器和指令宏生成器首先生成一组参数化的运算符。然后，调谐器组态生成器对搜索空间进行剪枝，分布式调谐器框架选择最佳数据流模式和相应参数。我们的张量计算单元支持所有具有全形状维度的卷积参数。它可以很容易地选择合适的算子，在一定的形状下达到96%的芯片峰值性能，并在有限的功率下找到最佳的性能实现。在我们的张量计算单元芯片上对大量卷积形状的评估表明，生成的运算符明显优于手写运算符，根据硅数据，其归一化性能比CUDA高9%。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Chip

CiteScore

2.80

自引率

0.00%

发文量