随机优化中自然梯度与对角黑森估计的联系

2021 55th Annual Conference on Information Sciences and Systems (CISS) Pub Date : 2021-03-24 DOI:10.1109/CISS50987.2021.9400243

Shiqing Sun, J. Spall

{"title":"随机优化中自然梯度与对角黑森估计的联系","authors":"Shiqing Sun, J. Spall","doi":"10.1109/CISS50987.2021.9400243","DOIUrl":null,"url":null,"abstract":"With massive resurgence of artificial intelligence, statistical learning theory and information science, the core technology of AI, are getting growing attention. To deal with massive data, efficient learning algorithms are required in statistical learning. In deep learning, natural gradient algorithms, such as AdaGrad and Adam, are widely used, motivated by the idea of Newton's approach that applies second-order derivatives to rescale gradients. By approximating the second-order geometry of the empirical loss with the empirical Fisher information matrix (FIM), natural gradient methods are expected to obtain extra efficiency of learning. However, the exact curvature of the empirical loss is described by the Hessian matrix, not the FIM, and biases between the empirical FIM and the Hessian always exist before convergence, which will affect the expected efficiency. In this paper, we present a new stochastic optimization algorithm, diagSG (diagonal Hessian stochastic gradient), in the setting of deep learning. As a second-order algorithm, diagSG estimates the diagonal entries of the Hessian matrix at each iteration through simultaneous perturbation stochastic approximation (SPSA) and applies the diagonal entries for the adaptive learning rate in optimization. By comparing the rescaling matrices in diagSG and in natural gradient methods, we argue that diagSG possess advantages in characterizing loss curvature with better approximation of Hessian diagonals. In practical part, we provide a experiment to endorse our argument.","PeriodicalId":228112,"journal":{"name":"2021 55th Annual Conference on Information Sciences and Systems (CISS)","volume":"12 2 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-03-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":"{\"title\":\"Connection of Diagonal Hessian Estimates to Natural Gradients in Stochastic Optimization\",\"authors\":\"Shiqing Sun, J. Spall\",\"doi\":\"10.1109/CISS50987.2021.9400243\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"With massive resurgence of artificial intelligence, statistical learning theory and information science, the core technology of AI, are getting growing attention. To deal with massive data, efficient learning algorithms are required in statistical learning. In deep learning, natural gradient algorithms, such as AdaGrad and Adam, are widely used, motivated by the idea of Newton's approach that applies second-order derivatives to rescale gradients. By approximating the second-order geometry of the empirical loss with the empirical Fisher information matrix (FIM), natural gradient methods are expected to obtain extra efficiency of learning. However, the exact curvature of the empirical loss is described by the Hessian matrix, not the FIM, and biases between the empirical FIM and the Hessian always exist before convergence, which will affect the expected efficiency. In this paper, we present a new stochastic optimization algorithm, diagSG (diagonal Hessian stochastic gradient), in the setting of deep learning. As a second-order algorithm, diagSG estimates the diagonal entries of the Hessian matrix at each iteration through simultaneous perturbation stochastic approximation (SPSA) and applies the diagonal entries for the adaptive learning rate in optimization. By comparing the rescaling matrices in diagSG and in natural gradient methods, we argue that diagSG possess advantages in characterizing loss curvature with better approximation of Hessian diagonals. In practical part, we provide a experiment to endorse our argument.\",\"PeriodicalId\":228112,\"journal\":{\"name\":\"2021 55th Annual Conference on Information Sciences and Systems (CISS)\",\"volume\":\"12 2 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2021-03-24\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"1\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2021 55th Annual Conference on Information Sciences and Systems (CISS)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/CISS50987.2021.9400243\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2021 55th Annual Conference on Information Sciences and Systems (CISS)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/CISS50987.2021.9400243","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 1

摘要

随着人工智能的大规模复苏，作为人工智能核心技术的统计学习理论和信息科学受到越来越多的关注。为了处理海量数据，统计学习需要高效的学习算法。在深度学习中，自然梯度算法，如AdaGrad和Adam，被广泛使用，其动机是牛顿方法的想法，即应用二阶导数来重新缩放梯度。利用经验Fisher信息矩阵(FIM)逼近经验损失的二阶几何形式，期望自然梯度方法能获得额外的学习效率。然而，经验损失的确切曲率是由Hessian矩阵而不是FIM来描述的，并且在收敛之前经验FIM与Hessian之间总是存在偏差，这将影响期望效率。在深度学习的背景下，提出了一种新的随机优化算法diagSG (diagonal Hessian stochastic gradient)。作为一种二阶算法，diagSG通过同步摄动随机逼近(SPSA)估计每次迭代时Hessian矩阵的对角线项，并将对角线项用于优化中的自适应学习率。通过对自然梯度法和diagSG法中重标矩阵的比较，我们认为diagSG法在表征损失曲率方面具有更好的逼近Hessian对角线的优势。在实践部分，我们提供了一个实验来支持我们的论点。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Connection of Diagonal Hessian Estimates to Natural Gradients in Stochastic Optimization

With massive resurgence of artificial intelligence, statistical learning theory and information science, the core technology of AI, are getting growing attention. To deal with massive data, efficient learning algorithms are required in statistical learning. In deep learning, natural gradient algorithms, such as AdaGrad and Adam, are widely used, motivated by the idea of Newton's approach that applies second-order derivatives to rescale gradients. By approximating the second-order geometry of the empirical loss with the empirical Fisher information matrix (FIM), natural gradient methods are expected to obtain extra efficiency of learning. However, the exact curvature of the empirical loss is described by the Hessian matrix, not the FIM, and biases between the empirical FIM and the Hessian always exist before convergence, which will affect the expected efficiency. In this paper, we present a new stochastic optimization algorithm, diagSG (diagonal Hessian stochastic gradient), in the setting of deep learning. As a second-order algorithm, diagSG estimates the diagonal entries of the Hessian matrix at each iteration through simultaneous perturbation stochastic approximation (SPSA) and applies the diagonal entries for the adaptive learning rate in optimization. By comparing the rescaling matrices in diagSG and in natural gradient methods, we argue that diagSG possess advantages in characterizing loss curvature with better approximation of Hessian diagonals. In practical part, we provide a experiment to endorse our argument.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2021 55th Annual Conference on Information Sciences and Systems (CISS)

自引率

0.00%

发文量