Hello everyone! In this blog post I introduce an optimizer called Cubic Step where I fit a cubic polynomial to the current and previous step’s losses and gradients to solve for the next update. My goal with this post is not to introduce some revolutionary optimizer but rather make aware the lack of the loss function’s utility in modern gradient based optimization.
Zeroth order calculations add just as much information about the loss manifold as any other sample, so why is it ignored? This is because most differntiable optimization techniques use gradient based climbing schemes that only care about the local curvature.
Other common techniques like momentum, adagrad, adam, etc. all work on either trying to approximate higher order terms or generate an adaptive learning rate ( in the above) to perform smarter updates. Considering this, loss is ignored because it has no bearing on the locality of the parameters in the loss manifold. This though is only true with a single draw; accross multiple updates these observed losses can even help such an update.
Here are some images to examplify the importance of zeroth order info (excuse my terrible drawing):
different curves with the same gradients at two samples
Excusing my terrible drawings in the figure, the idea is that even with maintaining first order information, the actual losses change the desired minima. I want to provide a motivating use-case. I devise the Cubic Step optimizer in this post to help deal with oscillations over a critical point in parameter space. The idea is simple, rather than taking a gradient step, I find a cubic that fits the four points and where is the current and previous parameter while is the loss function. I use this though only in the setting of an oscillating gradient, otherwise I take a normal ascent/descent step (for my experiments I use SGD but it could be any optimizer step). Denoting the fitted polynomial as , solving for the coefficients leads us to 4 equations and 4 unknowns. The matrix form is as so:
Its pretty easy to see the inverse exists as long as . In my initial implementation I got lazy and used tensorflow’s
tf.linalg.inv but that lead to a stream of issues (It worked fine on CPU but on GPU It always caused the program to hang even though I checked that the eigenvalues were numerically stable with respect to floating point precision) so I did what we all had to freshman year of college I did it by hand. The tedious excersize took around 5ish minutes. You get the inverse $Z$ as so…
Now that we did our little excersize we can actually run the optimizer. Unfortunately this entire scheme makes the difficult assumption that and are the same functions where is the iteration or batch number. This can be gauranteed if the batch size covers the whole dataset but as the batch gets smaller the variance increases. Lets test the effects of this empirically. I use a 4096 image subset of the MNIST dataset rather than the whole 60K so I can have batches run up to 100% of the data. For the experimental setup I reinitialize the weights for each batch size but keep initializations the same accross optimizer types. Note that I use batchsizes ranging form 128 to 4096 or 3-100% of the dataset. Also Cubic Step is only utilized in the regions where the weights oscillate, so it should be heavily impacted by the batch gradient’s variance– to find safer regions I use 25 epochs of warmup of only SGD.
results on a 5 layer CNN on MNIST 4096
These results were interesting and better than I expected. It seems the two methods perfomed nearly the same and for the most part batch percentage had very little effect on the efficacy of Cubic Step and it was also slightly smoother in some cases. The latter was the intended goal, but the former doesnt make much sense given how much this method should be dependant on intra-batch variance. I will atest that probably due to the smoothness of the shallow net’s manifold but will require further testing to be sure.
In conclusion the methods worked similarly but I think this result is enough to atleast give all who are reading this enough motivation to atleast keep the idea that zeroth order information is essentially free to access and could be applied in possible future work. Cubic Step only used 4 points, but in reality we can access all previous temporal samples if needed. I chose 4, because cubic was convenient to solve for critical points analytically, while post-5th order is provably impossible as a generality. Actually in a previous post, I go over Focal Gradient Loss as an intuitive competitor of Foal Loss, and that could be seen as an adaptive weighting dependant on zeroth order information too.
I hope you guys enjoyed the post and here is the repo for reproduction and playing yourself with Cubic Step