TASO: Time and Space Optimization for Memory-Constrained DNN Inference

2020 IEEE 32nd International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD) Pub Date : 2020-05-21 DOI:10.1109/SBAC-PAD49847.2020.00036

Yuan Wen, Andrew Anderson, Valentin Radu, M. O’Boyle, David Gregg

{"title":"TASO: Time and Space Optimization for Memory-Constrained DNN Inference","authors":"Yuan Wen, Andrew Anderson, Valentin Radu, M. O’Boyle, David Gregg","doi":"10.1109/SBAC-PAD49847.2020.00036","DOIUrl":null,"url":null,"abstract":"Convolutional neural networks (CNNs) are used in many embedded applications, from industrial robotics and automation systems to biometric identification on mobile devices. State-of-the-art classification is typically achieved by large networks, which are prohibitively expensive to run on mobile and embedded devices with tightly constrained memory and energy budgets. We propose an approach for ahead-of-time domain specific optimization of CNN models, based on an integer linear programming (ILP) for selecting primitive operations to implement convolutional layers. We optimize the trade-off between execution time and memory consumption by: 1) attempting to minimize execution time across the whole network by selecting data layouts and primitive operations to implement each layer; and 2) allocating an appropriate work space that reflects the upper bound of memory footprint per layer. These two optimization strategies can be used to run any CNN on any platform with a C compiler. Our evaluation with a range of popular ImageNet neural architectures (GoogleNet, AlexNet, VGG, ResNetand SqueezeNet) on the ARM Cortex-A15 yields speedups of 8× compared to a greedy algorithm based primitive selection, reduces memory requirement by 2.2× while sacrificing only 15% of inference time compared to a solver that considers inference time only. In addition, our optimization approach exposes a range of optimal points for different configurations across the Pareto frontier of memory and latency trade-off, which can be used under arbitrary system constraints.","PeriodicalId":202581,"journal":{"name":"2020 IEEE 32nd International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD)","volume":"173 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2020-05-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"7","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2020 IEEE 32nd International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/SBAC-PAD49847.2020.00036","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 7

Abstract

Convolutional neural networks (CNNs) are used in many embedded applications, from industrial robotics and automation systems to biometric identification on mobile devices. State-of-the-art classification is typically achieved by large networks, which are prohibitively expensive to run on mobile and embedded devices with tightly constrained memory and energy budgets. We propose an approach for ahead-of-time domain specific optimization of CNN models, based on an integer linear programming (ILP) for selecting primitive operations to implement convolutional layers. We optimize the trade-off between execution time and memory consumption by: 1) attempting to minimize execution time across the whole network by selecting data layouts and primitive operations to implement each layer; and 2) allocating an appropriate work space that reflects the upper bound of memory footprint per layer. These two optimization strategies can be used to run any CNN on any platform with a C compiler. Our evaluation with a range of popular ImageNet neural architectures (GoogleNet, AlexNet, VGG, ResNetand SqueezeNet) on the ARM Cortex-A15 yields speedups of 8× compared to a greedy algorithm based primitive selection, reduces memory requirement by 2.2× while sacrificing only 15% of inference time compared to a solver that considers inference time only. In addition, our optimization approach exposes a range of optimal points for different configurations across the Pareto frontier of memory and latency trade-off, which can be used under arbitrary system constraints.

查看原文本刊更多论文

记忆约束下深度神经网络推理的时间和空间优化

卷积神经网络(cnn)用于许多嵌入式应用，从工业机器人和自动化系统到移动设备上的生物识别。最先进的分类通常是由大型网络实现的，这些网络在内存和能源预算受到严格限制的移动和嵌入式设备上运行的成本高得令人难以置信。我们提出了一种基于整数线性规划(ILP)的CNN模型的提前域特定优化方法，用于选择基本操作来实现卷积层。我们通过以下方式优化执行时间和内存消耗之间的权衡:1)通过选择数据布局和基本操作来实现每一层，试图最小化整个网络的执行时间;2)分配适当的工作空间，以反映每层内存占用的上限。这两种优化策略可用于使用C编译器在任何平台上运行任何CNN。我们在ARM Cortex-A15上对一系列流行的ImageNet神经架构(GoogleNet, AlexNet, VGG, ResNetand SqueezeNet)进行了评估，与基于贪婪算法的原元选择相比，速度提高了8倍，内存需求减少了2.2倍，而与只考虑推理时间的求解器相比，只牺牲了15%的推理时间。此外，我们的优化方法在内存和延迟权衡的Pareto边界上为不同的配置提供了一系列最优点，可以在任意系统约束下使用。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2020 IEEE 32nd International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD)

自引率

0.00%

发文量