{"title":"Flex-TPU:具有运行时可重构数据流架构的灵活 TPU","authors":"Mohammed Elbtity, Peyton Chandarana, Ramtin Zand","doi":"arxiv-2407.08700","DOIUrl":null,"url":null,"abstract":"Tensor processing units (TPUs) are one of the most well-known machine\nlearning (ML) accelerators utilized at large scale in data centers as well as\nin tiny ML applications. TPUs offer several improvements and advantages over\nconventional ML accelerators, like graphical processing units (GPUs), being\ndesigned specifically to perform the multiply-accumulate (MAC) operations\nrequired in the matrix-matrix and matrix-vector multiplies extensively present\nthroughout the execution of deep neural networks (DNNs). Such improvements\ninclude maximizing data reuse and minimizing data transfer by leveraging the\ntemporal dataflow paradigms provided by the systolic array architecture. While\nthis design provides a significant performance benefit, the current\nimplementations are restricted to a single dataflow consisting of either input,\noutput, or weight stationary architectures. This can limit the achievable\nperformance of DNN inference and reduce the utilization of compute units.\nTherefore, the work herein consists of developing a reconfigurable dataflow\nTPU, called the Flex-TPU, which can dynamically change the dataflow per layer\nduring run-time. Our experiments thoroughly test the viability of the Flex-TPU\ncomparing it to conventional TPU designs across multiple well-known ML\nworkloads. The results show that our Flex-TPU design achieves a significant\nperformance increase of up to 2.75x compared to conventional TPU, with only\nminor area and power overheads.","PeriodicalId":501291,"journal":{"name":"arXiv - CS - Performance","volume":"157 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-07-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Flex-TPU: A Flexible TPU with Runtime Reconfigurable Dataflow Architecture\",\"authors\":\"Mohammed Elbtity, Peyton Chandarana, Ramtin Zand\",\"doi\":\"arxiv-2407.08700\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Tensor processing units (TPUs) are one of the most well-known machine\\nlearning (ML) accelerators utilized at large scale in data centers as well as\\nin tiny ML applications. TPUs offer several improvements and advantages over\\nconventional ML accelerators, like graphical processing units (GPUs), being\\ndesigned specifically to perform the multiply-accumulate (MAC) operations\\nrequired in the matrix-matrix and matrix-vector multiplies extensively present\\nthroughout the execution of deep neural networks (DNNs). Such improvements\\ninclude maximizing data reuse and minimizing data transfer by leveraging the\\ntemporal dataflow paradigms provided by the systolic array architecture. While\\nthis design provides a significant performance benefit, the current\\nimplementations are restricted to a single dataflow consisting of either input,\\noutput, or weight stationary architectures. This can limit the achievable\\nperformance of DNN inference and reduce the utilization of compute units.\\nTherefore, the work herein consists of developing a reconfigurable dataflow\\nTPU, called the Flex-TPU, which can dynamically change the dataflow per layer\\nduring run-time. Our experiments thoroughly test the viability of the Flex-TPU\\ncomparing it to conventional TPU designs across multiple well-known ML\\nworkloads. The results show that our Flex-TPU design achieves a significant\\nperformance increase of up to 2.75x compared to conventional TPU, with only\\nminor area and power overheads.\",\"PeriodicalId\":501291,\"journal\":{\"name\":\"arXiv - CS - Performance\",\"volume\":\"157 1\",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-07-11\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"arXiv - CS - Performance\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/arxiv-2407.08700\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Performance","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2407.08700","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
摘要
张量处理单元(TPU)是最著名的机器学习(ML)加速器之一,在数据中心和小型 ML 应用中得到了大规模应用。与图形处理器(GPU)等传统 ML 加速器相比,TPU 具有多项改进和优势,其设计专门用于执行深度神经网络(DNN)执行过程中广泛存在的矩阵-矩阵和矩阵-矢量乘法所需的乘积(MAC)运算。这种改进包括通过利用系统阵列架构提供的时态数据流范例,最大限度地提高数据重用率,并最大限度地减少数据传输。虽然这种设计具有显著的性能优势,但目前的实现方式仅限于由输入、输出或权重固定架构组成的单一数据流。因此,本文的工作包括开发一种名为 Flex-TPU 的可重新配置数据流处理单元,它可以在运行时动态改变每层的数据流。我们的实验对 Flex-TPU 的可行性进行了全面测试,并将其与传统的 TPU 设计在多个著名的 ML 工作负载中进行了比较。结果表明,与传统 TPU 相比,我们的 Flex-TPU 设计实现了高达 2.75 倍的性能大幅提升,而面积和功耗开销却很小。
Flex-TPU: A Flexible TPU with Runtime Reconfigurable Dataflow Architecture
Tensor processing units (TPUs) are one of the most well-known machine
learning (ML) accelerators utilized at large scale in data centers as well as
in tiny ML applications. TPUs offer several improvements and advantages over
conventional ML accelerators, like graphical processing units (GPUs), being
designed specifically to perform the multiply-accumulate (MAC) operations
required in the matrix-matrix and matrix-vector multiplies extensively present
throughout the execution of deep neural networks (DNNs). Such improvements
include maximizing data reuse and minimizing data transfer by leveraging the
temporal dataflow paradigms provided by the systolic array architecture. While
this design provides a significant performance benefit, the current
implementations are restricted to a single dataflow consisting of either input,
output, or weight stationary architectures. This can limit the achievable
performance of DNN inference and reduce the utilization of compute units.
Therefore, the work herein consists of developing a reconfigurable dataflow
TPU, called the Flex-TPU, which can dynamically change the dataflow per layer
during run-time. Our experiments thoroughly test the viability of the Flex-TPU
comparing it to conventional TPU designs across multiple well-known ML
workloads. The results show that our Flex-TPU design achieves a significant
performance increase of up to 2.75x compared to conventional TPU, with only
minor area and power overheads.