Gradient Descent Demystified: How Popular Optimization Algorithms Work in Machine Learning and Neural Networks

An overview of gradient descent optimization algorithms

This post explores how many of the most popular gradient-based optimization algorithms actually work. Note: If you are looking for a review paper, this blog post is also available as an article on arXiv. Update 20.03.2020: Added a note on recent optimizers. Update 09.02.2018: Added AMSGrad. Update 24.11.2017: Most of the content in this article is now also available as slides. Update 15.06.2017: Added derivations of AdaMax and Nadam. Update 21.06.16: This post was posted to Hacker News. The disc...