Flex-TPU: A Flexible TPU with Runtime Reconfigurable Dataflow Architecture

arXiv - CS - Performance Pub Date : 2024-07-11 DOI:arxiv-2407.08700

Mohammed Elbtity, Peyton Chandarana, Ramtin Zand

{"title":"Flex-TPU: A Flexible TPU with Runtime Reconfigurable Dataflow Architecture","authors":"Mohammed Elbtity, Peyton Chandarana, Ramtin Zand","doi":"arxiv-2407.08700","DOIUrl":null,"url":null,"abstract":"Tensor processing units (TPUs) are one of the most well-known machine\nlearning (ML) accelerators utilized at large scale in data centers as well as\nin tiny ML applications. TPUs offer several improvements and advantages over\nconventional ML accelerators, like graphical processing units (GPUs), being\ndesigned specifically to perform the multiply-accumulate (MAC) operations\nrequired in the matrix-matrix and matrix-vector multiplies extensively present\nthroughout the execution of deep neural networks (DNNs). Such improvements\ninclude maximizing data reuse and minimizing data transfer by leveraging the\ntemporal dataflow paradigms provided by the systolic array architecture. While\nthis design provides a significant performance benefit, the current\nimplementations are restricted to a single dataflow consisting of either input,\noutput, or weight stationary architectures. This can limit the achievable\nperformance of DNN inference and reduce the utilization of compute units.\nTherefore, the work herein consists of developing a reconfigurable dataflow\nTPU, called the Flex-TPU, which can dynamically change the dataflow per layer\nduring run-time. Our experiments thoroughly test the viability of the Flex-TPU\ncomparing it to conventional TPU designs across multiple well-known ML\nworkloads. The results show that our Flex-TPU design achieves a significant\nperformance increase of up to 2.75x compared to conventional TPU, with only\nminor area and power overheads.","PeriodicalId":501291,"journal":{"name":"arXiv - CS - Performance","volume":"157 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-07-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Performance","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2407.08700","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Tensor processing units (TPUs) are one of the most well-known machine learning (ML) accelerators utilized at large scale in data centers as well as in tiny ML applications. TPUs offer several improvements and advantages over conventional ML accelerators, like graphical processing units (GPUs), being designed specifically to perform the multiply-accumulate (MAC) operations required in the matrix-matrix and matrix-vector multiplies extensively present throughout the execution of deep neural networks (DNNs). Such improvements include maximizing data reuse and minimizing data transfer by leveraging the temporal dataflow paradigms provided by the systolic array architecture. While this design provides a significant performance benefit, the current implementations are restricted to a single dataflow consisting of either input, output, or weight stationary architectures. This can limit the achievable performance of DNN inference and reduce the utilization of compute units. Therefore, the work herein consists of developing a reconfigurable dataflow TPU, called the Flex-TPU, which can dynamically change the dataflow per layer during run-time. Our experiments thoroughly test the viability of the Flex-TPU comparing it to conventional TPU designs across multiple well-known ML workloads. The results show that our Flex-TPU design achieves a significant performance increase of up to 2.75x compared to conventional TPU, with only minor area and power overheads.

查看原文本刊更多论文

Flex-TPU：具有运行时可重构数据流架构的灵活 TPU

张量处理单元（TPU）是最著名的机器学习（ML）加速器之一，在数据中心和小型 ML 应用中得到了大规模应用。与图形处理器（GPU）等传统 ML 加速器相比，TPU 具有多项改进和优势，其设计专门用于执行深度神经网络（DNN）执行过程中广泛存在的矩阵-矩阵和矩阵-矢量乘法所需的乘积（MAC）运算。这种改进包括通过利用系统阵列架构提供的时态数据流范例，最大限度地提高数据重用率，并最大限度地减少数据传输。虽然这种设计具有显著的性能优势，但目前的实现方式仅限于由输入、输出或权重固定架构组成的单一数据流。因此，本文的工作包括开发一种名为 Flex-TPU 的可重新配置数据流处理单元，它可以在运行时动态改变每层的数据流。我们的实验对 Flex-TPU 的可行性进行了全面测试，并将其与传统的 TPU 设计在多个著名的 ML 工作负载中进行了比较。结果表明，与传统 TPU 相比，我们的 Flex-TPU 设计实现了高达 2.75 倍的性能大幅提升，而面积和功耗开销却很小。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

arXiv - CS - Performance

自引率

0.00%

发文量