随机对角近似最大下降优化中的消失梯度分析

IF 1.1 4区计算机科学 Q4 COMPUTER SCIENCE, INFORMATION SYSTEMS

Journal of Information Science and Engineering Pub Date : 2020-09-01 DOI:10.6688/JISE.202009_36(5).0005

H. Tan, K. Lim

{"title":"随机对角近似最大下降优化中的消失梯度分析","authors":"H. Tan, K. Lim","doi":"10.6688/JISE.202009_36(5).0005","DOIUrl":null,"url":null,"abstract":"Deep learning neural network is often associated with high complexity classification problems by stacking multiple hidden layers between input and output. The measured error is backpropagated layer-by-layer in a network with gradual vanishing gradient value due to the differentiation of activation function. In this paper, Stochastic Diagonal Approximate Greatest Descent (SDAGD) is proposed to tackle the issue of vanishing gradient in the deep learning neural network using the adaptive step length derived based on the second-order derivatives information. The proposed SDAGD optimizer trajectory is demonstrated using three-dimensional error surfaces, i:e: (a) a hilly error surface with two local minima and one global minimum; (b) a deep Gaussian trench to simulate drastic gradient changes experienced with ravine topography and (c) small initial gradient to simulate a plateau terrain. As a result, SDAGD is able to converge at the fastest rate to the global minimum without the interference of vanishing gradient issue as compared to other benchmark optimizers such as Gradient Descent (GD), AdaGrad and AdaDelta. Experiments are tested on saturated and unsaturated activation functions using sequential added hidden layers to evaluate the vanishing gradient mitigation with the proposed optimizer. The experimental results show that SDAGD is able to obtain good performance in the tested deep feedforward networks while stochastic GD obtain worse misclassification error when the network has more than three hidden layers due to the vanishing gradient issue. SDAGD can mitigate the vanishing gradient by adaptively control the step length element in layers using the second-order information. At the constant training iteration setup, SDAGD with ReLU can achieve the lowest misclassification rate of 1.77% as compared to other optimization methods.","PeriodicalId":50177,"journal":{"name":"Journal of Information Science and Engineering","volume":"40 1","pages":"1007-1019"},"PeriodicalIF":1.1000,"publicationDate":"2020-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Vanishing Gradient Analysis in Stochastic Diagonal Approximate Greatest Descent Optimization\",\"authors\":\"H. Tan, K. Lim\",\"doi\":\"10.6688/JISE.202009_36(5).0005\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Deep learning neural network is often associated with high complexity classification problems by stacking multiple hidden layers between input and output. The measured error is backpropagated layer-by-layer in a network with gradual vanishing gradient value due to the differentiation of activation function. In this paper, Stochastic Diagonal Approximate Greatest Descent (SDAGD) is proposed to tackle the issue of vanishing gradient in the deep learning neural network using the adaptive step length derived based on the second-order derivatives information. The proposed SDAGD optimizer trajectory is demonstrated using three-dimensional error surfaces, i:e: (a) a hilly error surface with two local minima and one global minimum; (b) a deep Gaussian trench to simulate drastic gradient changes experienced with ravine topography and (c) small initial gradient to simulate a plateau terrain. As a result, SDAGD is able to converge at the fastest rate to the global minimum without the interference of vanishing gradient issue as compared to other benchmark optimizers such as Gradient Descent (GD), AdaGrad and AdaDelta. Experiments are tested on saturated and unsaturated activation functions using sequential added hidden layers to evaluate the vanishing gradient mitigation with the proposed optimizer. The experimental results show that SDAGD is able to obtain good performance in the tested deep feedforward networks while stochastic GD obtain worse misclassification error when the network has more than three hidden layers due to the vanishing gradient issue. SDAGD can mitigate the vanishing gradient by adaptively control the step length element in layers using the second-order information. At the constant training iteration setup, SDAGD with ReLU can achieve the lowest misclassification rate of 1.77% as compared to other optimization methods.\",\"PeriodicalId\":50177,\"journal\":{\"name\":\"Journal of Information Science and Engineering\",\"volume\":\"40 1\",\"pages\":\"1007-1019\"},\"PeriodicalIF\":1.1000,\"publicationDate\":\"2020-09-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Journal of Information Science and Engineering\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://doi.org/10.6688/JISE.202009_36(5).0005\",\"RegionNum\":4,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q4\",\"JCRName\":\"COMPUTER SCIENCE, INFORMATION SYSTEMS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Information Science and Engineering","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.6688/JISE.202009_36(5).0005","RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q4","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}

引用次数: 0

摘要

深度学习神经网络通常通过在输入和输出之间叠加多个隐藏层来解决高复杂性的分类问题。在梯度值逐渐消失的网络中，由于激活函数的微分，测量误差逐层反向传播。针对深度学习神经网络中梯度消失的问题，提出了基于二阶导数信息的自适应步长随机对角近似最大下降算法(SDAGD)。提出的SDAGD优化器轨迹使用三维误差曲面进行了演示，i:e:(a)一个具有两个局部最小值和一个全局最小值的丘陵误差曲面;(b)深高斯沟，模拟峡谷地形的剧烈梯度变化;(c)小初始梯度，模拟高原地形。因此，与其他基准优化器(如梯度下降(GD)、AdaGrad和AdaDelta)相比，SDAGD能够以最快的速度收敛到全局最小值，而不会受到梯度消失问题的干扰。在饱和和非饱和激活函数上进行了实验，使用顺序添加隐藏层来评估该优化器对消失梯度的缓解效果。实验结果表明，在所测试的深度前馈网络中，SDAGD能够获得较好的性能，而随机GD由于梯度消失问题，当网络隐藏层超过3层时，会产生较差的误分类误差。SDAGD利用二阶信息自适应控制层内的步长元素，减轻了梯度消失的影响。在恒训练迭代设置下，与其他优化方法相比，采用ReLU的SDAGD的误分类率最低，为1.77%。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Vanishing Gradient Analysis in Stochastic Diagonal Approximate Greatest Descent Optimization

Deep learning neural network is often associated with high complexity classification problems by stacking multiple hidden layers between input and output. The measured error is backpropagated layer-by-layer in a network with gradual vanishing gradient value due to the differentiation of activation function. In this paper, Stochastic Diagonal Approximate Greatest Descent (SDAGD) is proposed to tackle the issue of vanishing gradient in the deep learning neural network using the adaptive step length derived based on the second-order derivatives information. The proposed SDAGD optimizer trajectory is demonstrated using three-dimensional error surfaces, i:e: (a) a hilly error surface with two local minima and one global minimum; (b) a deep Gaussian trench to simulate drastic gradient changes experienced with ravine topography and (c) small initial gradient to simulate a plateau terrain. As a result, SDAGD is able to converge at the fastest rate to the global minimum without the interference of vanishing gradient issue as compared to other benchmark optimizers such as Gradient Descent (GD), AdaGrad and AdaDelta. Experiments are tested on saturated and unsaturated activation functions using sequential added hidden layers to evaluate the vanishing gradient mitigation with the proposed optimizer. The experimental results show that SDAGD is able to obtain good performance in the tested deep feedforward networks while stochastic GD obtain worse misclassification error when the network has more than three hidden layers due to the vanishing gradient issue. SDAGD can mitigate the vanishing gradient by adaptively control the step length element in layers using the second-order information. At the constant training iteration setup, SDAGD with ReLU can achieve the lowest misclassification rate of 1.77% as compared to other optimization methods.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Journal of Information Science and Engineering 工程技术-计算机：信息系统

CiteScore

2.00

自引率

0.00%

发文量

审稿时长

8 months

期刊介绍： The Journal of Information Science and Engineering is dedicated to the dissemination of information on computer science, computer engineering, and computer systems. This journal encourages articles on original research in the areas of computer hardware, software, man-machine interface, theory and applications. tutorial papers in the above-mentioned areas, and state-of-the-art papers on various aspects of computer systems and applications.