{"title":"Flex-TPU: A Flexible TPU with Runtime Reconfigurable Dataflow Architecture","authors":"Mohammed Elbtity, Peyton Chandarana, Ramtin Zand","doi":"arxiv-2407.08700","DOIUrl":null,"url":null,"abstract":"Tensor processing units (TPUs) are one of the most well-known machine\nlearning (ML) accelerators utilized at large scale in data centers as well as\nin tiny ML applications. TPUs offer several improvements and advantages over\nconventional ML accelerators, like graphical processing units (GPUs), being\ndesigned specifically to perform the multiply-accumulate (MAC) operations\nrequired in the matrix-matrix and matrix-vector multiplies extensively present\nthroughout the execution of deep neural networks (DNNs). Such improvements\ninclude maximizing data reuse and minimizing data transfer by leveraging the\ntemporal dataflow paradigms provided by the systolic array architecture. While\nthis design provides a significant performance benefit, the current\nimplementations are restricted to a single dataflow consisting of either input,\noutput, or weight stationary architectures. This can limit the achievable\nperformance of DNN inference and reduce the utilization of compute units.\nTherefore, the work herein consists of developing a reconfigurable dataflow\nTPU, called the Flex-TPU, which can dynamically change the dataflow per layer\nduring run-time. Our experiments thoroughly test the viability of the Flex-TPU\ncomparing it to conventional TPU designs across multiple well-known ML\nworkloads. The results show that our Flex-TPU design achieves a significant\nperformance increase of up to 2.75x compared to conventional TPU, with only\nminor area and power overheads.","PeriodicalId":501291,"journal":{"name":"arXiv - CS - Performance","volume":"157 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-07-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Performance","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2407.08700","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
Tensor processing units (TPUs) are one of the most well-known machine
learning (ML) accelerators utilized at large scale in data centers as well as
in tiny ML applications. TPUs offer several improvements and advantages over
conventional ML accelerators, like graphical processing units (GPUs), being
designed specifically to perform the multiply-accumulate (MAC) operations
required in the matrix-matrix and matrix-vector multiplies extensively present
throughout the execution of deep neural networks (DNNs). Such improvements
include maximizing data reuse and minimizing data transfer by leveraging the
temporal dataflow paradigms provided by the systolic array architecture. While
this design provides a significant performance benefit, the current
implementations are restricted to a single dataflow consisting of either input,
output, or weight stationary architectures. This can limit the achievable
performance of DNN inference and reduce the utilization of compute units.
Therefore, the work herein consists of developing a reconfigurable dataflow
TPU, called the Flex-TPU, which can dynamically change the dataflow per layer
during run-time. Our experiments thoroughly test the viability of the Flex-TPU
comparing it to conventional TPU designs across multiple well-known ML
workloads. The results show that our Flex-TPU design achieves a significant
performance increase of up to 2.75x compared to conventional TPU, with only
minor area and power overheads.