Closing the Generalization Gap of Adaptive Gradient Methods in Training Deep Neural Networks

Jinghui Chen, Dongruo Zhou, Yiqi Tang, Ziyan Yang, Yuan Cao, Quanquan Gu

Add an identifier
Abstract:Adaptive gradient methods, which adopt historical gradient information to automatically adjust the learning rate, despite the nice property of fast convergence, have been observed to generalize worse than stochastic gradient descent (SGD) with momentum in training deep neural networks. This leaves how to close the generalization gap of adaptive gradient methods an open problem. In this work, we show that adaptive gradient methods such as Adam, Amsgrad, are sometimes "over adapted". We design a new algor ...
1 result • Page 1 of 1
Traffic: 1 users visited in the last hour