驯服野兽:编程Peta-FLOP类深度学习系统

Swagath Venkataramani, V. Srinivasan, Jungwook Choi, K. Gopalakrishnan, Leland Chang
{"title":"驯服野兽:编程Peta-FLOP类深度学习系统","authors":"Swagath Venkataramani, V. Srinivasan, Jungwook Choi, K. Gopalakrishnan, Leland Chang","doi":"10.1145/3218603.3241338","DOIUrl":null,"url":null,"abstract":"1 EXTENDED ABSTRACT The field of Artificial Intelligence (AI) has witnessed quintessential growth in recent years with the advent of Deep Neural Networks (DNNs) that have achieved state-of-the-art performance on challenging cognitive tasks involving images, videos, text and natural language. They are being increasingly deployed in many real-world services and products, and have pervaded the spectrum of computing devices from mobile/IoT devices to server-class platforms. However, DNNs are highly compute and data intensive workloads, far outstripping the capabilities of today’s computing platforms. For example, state-of-the-art image recognition DNNs require billions of operations to classify a single image. On the other hand, training DNN models demands exa-flops of compute and uses massive datasets requiring 100s of giga-bytes of memory. One approach to address the computational challenges imposed by DNNs is through the design of hardware accelerators, whose compute cores, memory hierarchy and interconnect topology are specialized to match the DNN’s compute and communication characteristics. Several such designs ranging from low-power IP cores to largescale accelerator systems have been proposed in literature. Some factors that enable the design of specialized systems for DNNs are: (i) their computations can be expressed as static data-flow graphs, (ii) their computation patterns are regular with no data-dependent control flows and offer abundant opportunities for data-reuse, and (iii) their functionality could be encapsulated within a set of few (tens of) basic functions (e.g. convolution, matrix-multiplication etc.). That said, DNNs also exhibit abundant heterogeneity at various levels. Across layers, the number of input and output channels and the dimensions of each feature are substantially different. Further, each layer comprises of operations whose Bytes/FLOP requirement vary by over two orders of magnitude. The heterogeneity in compute characteristics engenders a wide range of possibilities to spatiotemporally map DNNs on accelerator platforms, defined in terms of how computations are split across the different compute elements in the architecture and how computations assigned to a compute element are temporally sequenced in time. We are therefore led to ask whether it is possible to come up with a systematic exploration of the design space of mapping configurations to maximize DNN’s performance on a given accelerator architecture using a variety of different dataflows? How will the computations be partitioned and sequenced across the processing","PeriodicalId":20456,"journal":{"name":"Proceedings of the 2007 international symposium on Low power electronics and design (ISLPED '07)","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2018-07-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Taming the beast: Programming Peta-FLOP class Deep Learning Systems\",\"authors\":\"Swagath Venkataramani, V. Srinivasan, Jungwook Choi, K. Gopalakrishnan, Leland Chang\",\"doi\":\"10.1145/3218603.3241338\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"1 EXTENDED ABSTRACT The field of Artificial Intelligence (AI) has witnessed quintessential growth in recent years with the advent of Deep Neural Networks (DNNs) that have achieved state-of-the-art performance on challenging cognitive tasks involving images, videos, text and natural language. They are being increasingly deployed in many real-world services and products, and have pervaded the spectrum of computing devices from mobile/IoT devices to server-class platforms. However, DNNs are highly compute and data intensive workloads, far outstripping the capabilities of today’s computing platforms. For example, state-of-the-art image recognition DNNs require billions of operations to classify a single image. On the other hand, training DNN models demands exa-flops of compute and uses massive datasets requiring 100s of giga-bytes of memory. One approach to address the computational challenges imposed by DNNs is through the design of hardware accelerators, whose compute cores, memory hierarchy and interconnect topology are specialized to match the DNN’s compute and communication characteristics. Several such designs ranging from low-power IP cores to largescale accelerator systems have been proposed in literature. Some factors that enable the design of specialized systems for DNNs are: (i) their computations can be expressed as static data-flow graphs, (ii) their computation patterns are regular with no data-dependent control flows and offer abundant opportunities for data-reuse, and (iii) their functionality could be encapsulated within a set of few (tens of) basic functions (e.g. convolution, matrix-multiplication etc.). That said, DNNs also exhibit abundant heterogeneity at various levels. Across layers, the number of input and output channels and the dimensions of each feature are substantially different. Further, each layer comprises of operations whose Bytes/FLOP requirement vary by over two orders of magnitude. The heterogeneity in compute characteristics engenders a wide range of possibilities to spatiotemporally map DNNs on accelerator platforms, defined in terms of how computations are split across the different compute elements in the architecture and how computations assigned to a compute element are temporally sequenced in time. We are therefore led to ask whether it is possible to come up with a systematic exploration of the design space of mapping configurations to maximize DNN’s performance on a given accelerator architecture using a variety of different dataflows? How will the computations be partitioned and sequenced across the processing\",\"PeriodicalId\":20456,\"journal\":{\"name\":\"Proceedings of the 2007 international symposium on Low power electronics and design (ISLPED '07)\",\"volume\":null,\"pages\":null},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2018-07-23\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the 2007 international symposium on Low power electronics and design (ISLPED '07)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3218603.3241338\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 2007 international symposium on Low power electronics and design (ISLPED '07)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3218603.3241338","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

摘要

近年来,随着深度神经网络(dnn)的出现,人工智能(AI)领域见证了典型的增长,深度神经网络(dnn)在涉及图像、视频、文本和自然语言的挑战性认知任务上取得了最先进的表现。它们越来越多地部署在许多现实世界的服务和产品中,并且已经渗透到从移动/物联网设备到服务器级平台的计算设备的频谱中。然而,深度神经网络是高度计算和数据密集型的工作负载,远远超过了当今计算平台的能力。例如,最先进的图像识别dnn需要数十亿次操作才能对单个图像进行分类。另一方面,训练DNN模型需要无数次的计算,并使用需要100千兆字节内存的海量数据集。解决深度神经网络带来的计算挑战的一种方法是通过硬件加速器的设计,其计算核心,内存层次结构和互连拓扑结构专门用于匹配深度神经网络的计算和通信特性。从低功耗IP核到大规模加速器系统,已经在文献中提出了几种这样的设计。能够为dnn设计专门系统的一些因素是:(i)它们的计算可以表示为静态数据流图,(ii)它们的计算模式是规则的,没有数据依赖的控制流,并为数据重用提供了丰富的机会,(iii)它们的功能可以封装在一组几个(几十个)基本函数中(例如卷积,矩阵乘法等)。也就是说,dnn在各个层面上也表现出丰富的异质性。在各个层之间,输入和输出通道的数量以及每个特征的尺寸都有很大的不同。此外,每一层包含的操作的字节/FLOP需求变化超过两个数量级。计算特性的异质性为在加速器平台上对dnn进行时空映射提供了广泛的可能性,这是根据计算如何在架构中的不同计算元素之间进行划分以及分配给计算元素的计算如何在时间上进行排序来定义的。因此,我们不禁要问,是否有可能对映射配置的设计空间进行系统探索,从而在使用各种不同数据流的给定加速器架构上最大化DNN的性能?如何在整个处理过程中对计算进行分区和排序
本文章由计算机程序翻译,如有差异,请以英文原文为准。
Taming the beast: Programming Peta-FLOP class Deep Learning Systems
1 EXTENDED ABSTRACT The field of Artificial Intelligence (AI) has witnessed quintessential growth in recent years with the advent of Deep Neural Networks (DNNs) that have achieved state-of-the-art performance on challenging cognitive tasks involving images, videos, text and natural language. They are being increasingly deployed in many real-world services and products, and have pervaded the spectrum of computing devices from mobile/IoT devices to server-class platforms. However, DNNs are highly compute and data intensive workloads, far outstripping the capabilities of today’s computing platforms. For example, state-of-the-art image recognition DNNs require billions of operations to classify a single image. On the other hand, training DNN models demands exa-flops of compute and uses massive datasets requiring 100s of giga-bytes of memory. One approach to address the computational challenges imposed by DNNs is through the design of hardware accelerators, whose compute cores, memory hierarchy and interconnect topology are specialized to match the DNN’s compute and communication characteristics. Several such designs ranging from low-power IP cores to largescale accelerator systems have been proposed in literature. Some factors that enable the design of specialized systems for DNNs are: (i) their computations can be expressed as static data-flow graphs, (ii) their computation patterns are regular with no data-dependent control flows and offer abundant opportunities for data-reuse, and (iii) their functionality could be encapsulated within a set of few (tens of) basic functions (e.g. convolution, matrix-multiplication etc.). That said, DNNs also exhibit abundant heterogeneity at various levels. Across layers, the number of input and output channels and the dimensions of each feature are substantially different. Further, each layer comprises of operations whose Bytes/FLOP requirement vary by over two orders of magnitude. The heterogeneity in compute characteristics engenders a wide range of possibilities to spatiotemporally map DNNs on accelerator platforms, defined in terms of how computations are split across the different compute elements in the architecture and how computations assigned to a compute element are temporally sequenced in time. We are therefore led to ask whether it is possible to come up with a systematic exploration of the design space of mapping configurations to maximize DNN’s performance on a given accelerator architecture using a variety of different dataflows? How will the computations be partitioned and sequenced across the processing
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信