{"title":"基于端到端训练的紧凑神经网络:一种自动排序的贝叶斯张量方法","authors":"Cole Hawkins, Xing-er Liu, Zheng Zhang","doi":"10.1137/21m1391444","DOIUrl":null,"url":null,"abstract":"While post-training model compression can greatly reduce the inference cost of a deep neural network, uncompressed training still consumes a huge amount of hardware resources, run-time and energy. It is highly desirable to directly train a compact neural network from scratch with low memory and low computational cost. Low-rank tensor decomposition is one of the most effective approaches to reduce the memory and computing requirements of large-size neural networks. However, directly training a low-rank tensorized neural network is a very challenging task because it is hard to determine a proper tensor rank {\\it a priori}, which controls the model complexity and compression ratio in the training process. This paper presents a novel end-to-end framework for low-rank tensorized training of neural networks. We first develop a flexible Bayesian model that can handle various low-rank tensor formats (e.g., CP, Tucker, tensor train and tensor-train matrix) that compress neural network parameters in training. This model can automatically determine the tensor ranks inside a nonlinear forward model, which is beyond the capability of existing Bayesian tensor methods. We further develop a scalable stochastic variational inference solver to estimate the posterior density of large-scale problems in training. Our work provides the first general-purpose rank-adaptive framework for end-to-end tensorized training. Our numerical results on various neural network architectures show orders-of-magnitude parameter reduction and little accuracy loss (or even better accuracy) in the training process. Specifically, on a very large deep learning recommendation system with over $4.2\\times 10^9$ model parameters, our method can reduce the variables to only $1.6\\times 10^5$ automatically in the training process (i.e., by $2.6\\times 10^4$ times) while achieving almost the same accuracy.","PeriodicalId":74797,"journal":{"name":"SIAM journal on mathematics of data science","volume":"45 1","pages":"46-71"},"PeriodicalIF":1.9000,"publicationDate":"2020-10-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"17","resultStr":"{\"title\":\"Towards Compact Neural Networks via End-to-End Training: A Bayesian Tensor Approach with Automatic Rank Determination\",\"authors\":\"Cole Hawkins, Xing-er Liu, Zheng Zhang\",\"doi\":\"10.1137/21m1391444\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"While post-training model compression can greatly reduce the inference cost of a deep neural network, uncompressed training still consumes a huge amount of hardware resources, run-time and energy. It is highly desirable to directly train a compact neural network from scratch with low memory and low computational cost. Low-rank tensor decomposition is one of the most effective approaches to reduce the memory and computing requirements of large-size neural networks. However, directly training a low-rank tensorized neural network is a very challenging task because it is hard to determine a proper tensor rank {\\\\it a priori}, which controls the model complexity and compression ratio in the training process. This paper presents a novel end-to-end framework for low-rank tensorized training of neural networks. We first develop a flexible Bayesian model that can handle various low-rank tensor formats (e.g., CP, Tucker, tensor train and tensor-train matrix) that compress neural network parameters in training. This model can automatically determine the tensor ranks inside a nonlinear forward model, which is beyond the capability of existing Bayesian tensor methods. We further develop a scalable stochastic variational inference solver to estimate the posterior density of large-scale problems in training. Our work provides the first general-purpose rank-adaptive framework for end-to-end tensorized training. Our numerical results on various neural network architectures show orders-of-magnitude parameter reduction and little accuracy loss (or even better accuracy) in the training process. Specifically, on a very large deep learning recommendation system with over $4.2\\\\times 10^9$ model parameters, our method can reduce the variables to only $1.6\\\\times 10^5$ automatically in the training process (i.e., by $2.6\\\\times 10^4$ times) while achieving almost the same accuracy.\",\"PeriodicalId\":74797,\"journal\":{\"name\":\"SIAM journal on mathematics of data science\",\"volume\":\"45 1\",\"pages\":\"46-71\"},\"PeriodicalIF\":1.9000,\"publicationDate\":\"2020-10-17\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"17\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"SIAM journal on mathematics of data science\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1137/21m1391444\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"MATHEMATICS, APPLIED\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"SIAM journal on mathematics of data science","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1137/21m1391444","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"MATHEMATICS, APPLIED","Score":null,"Total":0}
引用次数: 17
摘要
虽然训练后模型压缩可以大大降低深度神经网络的推理成本,但未压缩的训练仍然消耗大量的硬件资源、运行时间和能量。直接从零开始训练具有低内存和低计算成本的紧凑型神经网络是非常理想的。低秩张量分解是减少大型神经网络内存和计算需求的最有效方法之一。然而,直接训练一个低秩张化神经网络是一项非常具有挑战性的任务,因为很难确定一个合适的张量秩(\it a priori),它在训练过程中控制着模型的复杂度和压缩比。提出了一种新颖的端到端神经网络低秩张化训练框架。我们首先开发了一个灵活的贝叶斯模型,可以处理各种低秩张量格式(例如,CP, Tucker,张量训练和张量训练矩阵),在训练中压缩神经网络参数。该模型可以自动确定非线性正演模型内的张量秩,这是现有贝叶斯张量方法所无法做到的。我们进一步开发了一个可扩展的随机变分推理求解器来估计训练中大规模问题的后验密度。我们的工作为端到端张化训练提供了第一个通用的秩自适应框架。我们在各种神经网络架构上的数值结果表明,在训练过程中参数降低了数量级,精度损失很小(甚至更高)。具体来说,在一个拥有超过4.2\times 10^9$模型参数的非常大的深度学习推荐系统上,我们的方法可以在训练过程中自动将变量减少到只有1.6\times 10^5$(即减少2.6\times 10^4$),同时达到几乎相同的精度。
Towards Compact Neural Networks via End-to-End Training: A Bayesian Tensor Approach with Automatic Rank Determination
While post-training model compression can greatly reduce the inference cost of a deep neural network, uncompressed training still consumes a huge amount of hardware resources, run-time and energy. It is highly desirable to directly train a compact neural network from scratch with low memory and low computational cost. Low-rank tensor decomposition is one of the most effective approaches to reduce the memory and computing requirements of large-size neural networks. However, directly training a low-rank tensorized neural network is a very challenging task because it is hard to determine a proper tensor rank {\it a priori}, which controls the model complexity and compression ratio in the training process. This paper presents a novel end-to-end framework for low-rank tensorized training of neural networks. We first develop a flexible Bayesian model that can handle various low-rank tensor formats (e.g., CP, Tucker, tensor train and tensor-train matrix) that compress neural network parameters in training. This model can automatically determine the tensor ranks inside a nonlinear forward model, which is beyond the capability of existing Bayesian tensor methods. We further develop a scalable stochastic variational inference solver to estimate the posterior density of large-scale problems in training. Our work provides the first general-purpose rank-adaptive framework for end-to-end tensorized training. Our numerical results on various neural network architectures show orders-of-magnitude parameter reduction and little accuracy loss (or even better accuracy) in the training process. Specifically, on a very large deep learning recommendation system with over $4.2\times 10^9$ model parameters, our method can reduce the variables to only $1.6\times 10^5$ automatically in the training process (i.e., by $2.6\times 10^4$ times) while achieving almost the same accuracy.