IMHO - what the adam optimizer is really doing is a very clever version of natural gradient ascent, calculating online estimates of the diagonal of the Fisher information matrix -- without bias -- and using it to scale gradient steps.
Very cool, and what amazing engineering to be used so broadly :)
Very cool, and what amazing engineering to be used so broadly :)