Abstract:Adaptive gradient methods, which adopt historical gradient information to
automatically adjust the learning rate, despite the nice property of fast
convergence, have been observed to generalize worse than stochastic gradient
descent (SGD) with momentum in training deep neural networks. This leaves how
to close the generalization gap of adaptive gradient methods an open problem.
In this work, we show that adaptive gradient methods such as Adam, Amsgrad, are
sometimes "over adapted". We design a new algor ...
Has Padam optimizer found utility in practice for optimizing neural network classifiers?
The paper claimed "Experiments on standard benchmarks show that our proposed algorithm can maintain a [sic] fast convergence rate as Adam/Amsgrad while generalizing as well as SGD in training deep neural networks." yet I'm yet to encounter this being used in practice.
Is the author's claim true for a wide range of datasets or should I keep using older optimization methods?