Closing the Generalization Gap of Adaptive Gradient Methods in Training Deep Neural Networks

Jinghui Chen, Dongruo Zhou, Yiqi Tang, Ziyan Yang, Yuan Cao, Quanquan Gu

Add an identifier
Abstract:Adaptive gradient methods, which adopt historical gradient information to automatically adjust the learning rate, despite the nice property of fast convergence, have been observed to generalize worse than stochastic gradient descent (SGD) with momentum in training deep neural networks. This leaves how to close the generalization gap of adaptive gradient methods an open problem. In this work, we show that adaptive gradient methods such as Adam, Amsgrad, are sometimes "over adapted". We design a new algor ...
Has Padam optimizer found utility in practice for optimizing neural network classifiers?
Entering edit mode
2.9 years ago
FJ Tunes ▴ 22

The paper claimed "Experiments on standard benchmarks show that our proposed algorithm can maintain a [sic] fast convergence rate as Adam/Amsgrad while generalizing as well as SGD in training deep neural networks." yet I'm yet to encounter this being used in practice.

Is the author's claim true for a wide range of datasets or should I keep using older optimization methods?

training dnn optimizers ml • 691 views

Login before adding your answer.

Traffic: 1 users visited in the last hour