New SGD-SaI Algorithm Outperforms AdamW, Cuts Memory Usage in Half for Training Deep Neural Networks

No More Adam: Learning Rate Scaling at Initialization is All You Need

View PDF Abstract:In this work, we question the necessity of adaptive gradient methods for training deep neural networks. SGD-SaI is a simple yet effective enhancement to stochastic gradient descent with momentum (SGDM). SGD-SaI performs learning rate Scaling at Initialization (SaI) to distinct parameter groups, guided by their respective gradient signal-to-noise ratios (g-SNR). By adjusting learning rates without relying on adaptive second-order momentum, SGD-SaI helps prevent training imbalanc...