Closing the Generalization Gap of Adaptive Gradient Methods in Training
Deep Neural Networks
Jinghui Chen, Dongruo Zhou, Yiqi Tang, Ziyan Yang, Yuan Cao, Quanquan Gu
Abstract: Adaptive gradient methods, which adopt historical gradient information to
automatically adjust the learning rate, despite the nice property of fast
convergence, have been observed to generalize worse than stochastic gradient
descent (SGD) with momentum in training deep neural networks. This leaves how
to close the generalization gap of adaptive gradient methods an open problem.
In this work, we show that adaptive gradient methods such as Adam, Amsgrad, are
sometimes "over adapted". We design a new algor ...
Traffic: 1 users visited in the last hour