An overview of gradient descent optimization algorithms
This post explores how many of the most popular gradient-based optimization algorithms actually work.
Note: If you are looking for a review paper, this blog post is also available as an article on arXiv.
Update 20.03.2020: Added a note on recent optimizers.
Update 09.02.2018: Added AMSGrad.
Update 24.11.2017: Most of the content in this article is now also available as slides.
Update 15.06.2017: Added derivations of AdaMax and Nadam.
Update 21.06.16: This post was posted to Hacker News. The disc...
Read more at ruder.io