Beyond Gradient Averaging in Parallel Optimization: Improved Robustness through Gradient Agreement Filtering
View PDF
HTML (experimental)
Abstract:We introduce Gradient Agreement Filtering (GAF) to improve on gradient averaging in distributed deep learning optimization. Traditional distributed data-parallel stochastic gradient descent involves averaging gradients of microbatches to calculate a macrobatch gradient that is then used to update model parameters. We find that gradients across microbatches are often orthogonal or negatively correlated, especially in late stages of training, which leads to me...
Read more at arxiv.org