Closing the Generalization Gap of Adaptive Gradient Methods in Training Deep Neural Networks

Jinghui Chen, Dongruo Zhou, Yiqi Tang, Ziyan Yang, Yuan Cao, Quanquan Gu

Abstract: Adaptive gradient methods, which adopt historical gradient information to automatically adjust the learning rate, despite the nice property of fast convergence, have been observed to generalize worse than stochastic gradient descent (SGD) with momentum in training deep neural networks. This leaves how to close the generalization gap of adaptive gradient methods an open problem. In this work, we show that adaptive gradient methods such as Adam, Amsgrad, are sometimes "over adapted". We design a new algor ...

Has Padam optimizer found utility in practice for optimizing neural network classifiers?

Entering edit mode

3.0 years ago

FJ Tunes ▴ 22

The paper claimed "Experiments on standard benchmarks show that our proposed algorithm can maintain a [sic] fast convergence rate as Adam/Amsgrad while generalizing as well as SGD in training deep neural networks." yet I'm yet to encounter this being used in practice.

Is the author's claim true for a wide range of datasets or should I keep using older optimization methods?

training dnn optimizers ml • 729 views

ADD COMMENT

last updated by Dustin 125 • posted by FJ Tunes

We appreciate the help!

Closing the Generalization Gap of Adaptive Gradient Methods in Training Deep Neural Networks