A Performance Improvement Approach for Second-Order Optimization in Large Mini-batch Training

2019 19th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID) Pub Date : 2019-05-14 DOI:10.1109/CCGRID.2019.00092

Hiroki Naganuma, Rio Yokota

{"title":"A Performance Improvement Approach for Second-Order Optimization in Large Mini-batch Training","authors":"Hiroki Naganuma, Rio Yokota","doi":"10.1109/CCGRID.2019.00092","DOIUrl":null,"url":null,"abstract":"Classical learning theory states that when the number of parameters of the model is too large compared to the data, the model will overfit and the generalization performance deteriorates. However, it has been empirically shown that deep neural networks (DNN) can achieve high generalization capability by training with extremely large amount of data and model parameters, which exceeds the predictions of classical learning theory. One drawback of this is that training of DNN requires enormous calculation time. Therefore, it is necessary to reduce the training time through large scale parallelization. Straightforward data-parallelization of DNN degrades convergence and generalization. In the present work, we investigate the possibility of using second order methods to solve this generalization gap in large-batch training. This is motivated by our observation that each mini-batch becomes more statistically stable, and thus the effect of considering the curvature plays a more important role in large-batch training. We have also found that naively adapting the natural gradient method causes the generalization performance to deteriorate further due to the lack of regularization capability. We propose an improved second order method by smoothing the loss function, which allows second-order methods to generalize as well as mini-batch SGD.","PeriodicalId":234571,"journal":{"name":"2019 19th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID)","volume":"8 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2019-05-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"3","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2019 19th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/CCGRID.2019.00092","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 3

Abstract

Classical learning theory states that when the number of parameters of the model is too large compared to the data, the model will overfit and the generalization performance deteriorates. However, it has been empirically shown that deep neural networks (DNN) can achieve high generalization capability by training with extremely large amount of data and model parameters, which exceeds the predictions of classical learning theory. One drawback of this is that training of DNN requires enormous calculation time. Therefore, it is necessary to reduce the training time through large scale parallelization. Straightforward data-parallelization of DNN degrades convergence and generalization. In the present work, we investigate the possibility of using second order methods to solve this generalization gap in large-batch training. This is motivated by our observation that each mini-batch becomes more statistically stable, and thus the effect of considering the curvature plays a more important role in large-batch training. We have also found that naively adapting the natural gradient method causes the generalization performance to deteriorate further due to the lack of regularization capability. We propose an improved second order method by smoothing the loss function, which allows second-order methods to generalize as well as mini-batch SGD.

查看原文本刊更多论文

大型小批量训练中二阶优化的性能改进方法

经典学习理论认为，当模型的参数数量与数据相比过大时，模型会过拟合，泛化性能下降。然而，经验表明，深度神经网络(deep neural network, DNN)可以通过极其大量的数据和模型参数进行训练，从而达到很高的泛化能力，这超出了经典学习理论的预测。这样做的一个缺点是训练深度神经网络需要大量的计算时间。因此，有必要通过大规模并行化来减少训练时间。直接的数据并行化降低了深度神经网络的收敛性和泛化性。在目前的工作中，我们研究了在大批量训练中使用二阶方法来解决这种泛化差距的可能性。这是由于我们观察到每个小批在统计上变得更加稳定，因此考虑曲率的效果在大批训练中起着更重要的作用。我们还发现，由于缺乏正则化能力，天真地采用自然梯度方法会导致泛化性能进一步恶化。我们提出了一种改进的二阶方法，通过平滑损失函数，使二阶方法可以泛化以及小批量SGD。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2019 19th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID)

自引率

0.00%

发文量