Jekyll2020-04-05T21:49:05+00:00https://mshlis.github.io/feed.xmlHomepagegreat blogs by a great dudeMichael Shliselberg[Idea] Cubic Step: We Have More Info, Lets Use It!2019-11-25T00:00:00+00:002019-11-25T00:00:00+00:00https://mshlis.github.io/CubicStep<p>Hello everyone! In this blog post I introduce an optimizer called Cubic Step where I fit a cubic polynomial to the current and previous step’s losses and gradients to solve for the next update. My goal with this post is not to introduce some revolutionary optimizer but rather make aware the lack of the loss function’s utility in modern gradient based optimization.</p>
<p>Zeroth order calculations add just as much information about the loss manifold as any other sample, <strong>so why is it ignored?</strong> This is because most differntiable optimization techniques use gradient based climbing schemes that only care about the local curvature.</p>
<script type="math/tex; mode=display">\begin{aligned}
x_{t+1} \leftarrow x_t \pm \eta \frac{\partial f}{\partial x_t} \\
\end{aligned}</script>
<p>Other common techniques like momentum, adagrad, adam, etc. all work on either trying to approximate higher order terms or generate an adaptive learning rate (<script type="math/tex">\eta</script> in the above) to perform smarter updates. Considering this, loss is ignored because it has no bearing on the locality of the parameters in the loss manifold. This though is only true with a single draw; accross multiple updates these observed losses can even help such an update.</p>
<p>Here are some images to examplify the importance of zeroth order info (excuse my terrible drawing):</p>
<p align="center">
<img src="/images/CubicStep/example_curves.png" />
<br /><b>different curves with the same gradients at two samples</b>
</p>
<p>Excusing my terrible drawings in the figure, the idea is that even with maintaining first order information, the actual losses change the desired minima. I want to provide a motivating use-case. I devise the Cubic Step optimizer in this post to help deal with oscillations over a critical point in parameter space. The idea is simple, rather than taking a gradient step, I find a cubic that fits the four points <script type="math/tex">l(w_0), l(w_1), l'(w_0)</script> and <script type="math/tex">l'(w_1)</script> where <script type="math/tex">w_0,w_1</script> is the current and previous parameter while <script type="math/tex">l(w)</script> is the loss function. I use this though only in the setting of an oscillating gradient, otherwise I take a normal ascent/descent step (for my experiments I use SGD but it could be any optimizer step). Denoting the fitted polynomial as <script type="math/tex">p(x) = \alpha_0 + \alpha_1 x + \alpha_2 x^2 + \alpha_3 x^3</script>, solving for the coefficients leads us to 4 equations and 4 unknowns. The matrix form is as so:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{pmatrix} 1 & w_0 & w_0^2 & w_0^3 \\
1 & w_1 & w_1^2 & w_1^3 \\
0 & 1 & 2w_0 & 3w_0^2 \\
0 & 1 & 2w_1 & 3w_1^2 \end{pmatrix}
\begin{pmatrix} \alpha_0 \\
\alpha_1 \\
\alpha_2 \\
\alpha_3 \end{pmatrix}=
\begin{pmatrix} l(w_0) \\
l(w_1) \\
l'(w_0) \\
l'(w_1) \end{pmatrix} %]]></script>
<p>or equivalently:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{pmatrix} \alpha_0 \\
\alpha_1 \\
\alpha_2 \\
\alpha_3 \end{pmatrix}=
\begin{pmatrix} 1 & w_0 & w_0^2 & w_0^3 \\
1 & w_1 & w_1^2 & w_1^3 \\
0 & 1 & 2w_0 & 3w_0^2 \\
0 & 1 & 2w_1 & 3w_1^2 \end{pmatrix}^{-1}
\begin{pmatrix} l(w_0) \\
l(w_1) \\
l'(w_0) \\
l'(w_1) \end{pmatrix} %]]></script>
<p>Its pretty easy to see the inverse exists as long as <script type="math/tex">w_0 \neq w_1</script>. In my initial implementation I got lazy and used tensorflow’s <code class="language-plaintext highlighter-rouge">tf.linalg.inv</code> but that lead to a stream of issues (It worked fine on CPU but on GPU It always caused the program to hang even though I checked that the eigenvalues were numerically stable with respect to floating point precision) so I did what we all had to freshman year of college <script type="math/tex">\rightarrow</script> I did it by hand. The tedious excersize took around 5ish minutes. You get the inverse $Z$ as so…</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
Z[:,:] &\leftarrow I_{(4x4)} \\
Z[3,:] &\leftarrow Z[3,:] - \frac{2}{(w_0-w_1)}(Z[0,:]-Z[1,:]) + Z[2,:] \\
Z[3,:] &\leftarrow \frac{Z[3,:]}{(w_1-w_0)^2} \\
Z[2,:] &\leftarrow Z[2,:] - \frac{1}{(w_0-w_1)}(Z[0,:]-Z[1,:]) - (2w_0^2 - w_0w_1 - w_1^2)Z[3,:] \\
Z[2,:] &\leftarrow \frac{Z[2,:]}{(w_0-w_1)} \\
Z[1,:] &\leftarrow Z[1,:] - Z[0,:] - (w_1^2-w_0^2)Z[2,:] - (w_1^3-w_0^3)Z[3,:] \\
Z[1,:] &\leftarrow \frac{Z[1,:]}{(w_1-w_0)} \\
Z[0,:] &\leftarrow Z[0,:] - w_0Z[1] - w_0^2Z[2,:] - w_0^3Z[3,:] \\
\end{aligned} %]]></script>
<p>Now that we did our little excersize we can actually run the optimizer. Unfortunately this entire scheme makes the difficult assumption that <script type="math/tex">l_k(w)</script> and <script type="math/tex">l_{k+1}(w)</script> are the same functions where <script type="math/tex">k</script> is the iteration or batch number. This can be gauranteed if the batch size covers the whole dataset but as the batch gets smaller the variance increases. Lets test the effects of this empirically. I use a 4096 image subset of the MNIST dataset rather than the whole 60K so I can have batches run up to 100% of the data. For the experimental setup I reinitialize the weights for each batch size but keep initializations the same accross optimizer types. Note that I use batchsizes ranging form 128 to 4096 or 3-100% of the dataset. Also Cubic Step is only utilized in the regions where the weights oscillate, so it should be heavily impacted by the batch gradient’s variance– to find safer regions I use 25 epochs of warmup of only SGD.</p>
<p align="center">
<img src="/images/CubicStep/mnist_results.png" />
<br /><b>results on a 5 layer CNN on MNIST 4096</b>
</p>
<p>These results were interesting and better than I expected. It seems the two methods perfomed nearly the same and for the most part batch percentage had very little effect on the efficacy of Cubic Step and it was also slightly smoother in some cases. The latter was the intended goal, but the former doesnt make much sense given how much this method should be dependant on intra-batch variance. I will atest that probably due to the smoothness of the shallow net’s manifold but will require further testing to be sure.</p>
<p>In conclusion the methods worked similarly but I think this result is enough to atleast give all who are reading this enough motivation to atleast keep the idea that zeroth order information is essentially free to access and could be applied in possible future work. Cubic Step only used 4 points, but in reality we can access all previous temporal samples if needed. I chose 4, because cubic was convenient to solve for critical points analytically, while post-5th order is provably impossible as a generality. Actually in a previous post, I go over Focal Gradient Loss as an intuitive competitor of Foal Loss, and that could be seen as an adaptive weighting dependant on zeroth order information too.</p>
<p>I hope you guys enjoyed the post and <a href="https://github.com/mshlis/CubicStep">here</a> is the repo for reproduction and playing yourself with Cubic Step</p>Michael ShliselbergCurrent gradient based optimization ignores 0th order information. In this post I want to motivate others to try utilizing it.[Idea] Cubic Step: We Have More Info, Lets Use It!2019-11-25T00:00:00+00:002019-11-25T00:00:00+00:00https://mshlis.github.io/CubicStep<p>Hello everyone! In this blog post I introduce an optimizer called Cubic Step where I fit a cubic polynomial to the current and previous step’s losses and gradients to solve for the next update. My goal with this post is not to introduce some revolutionary optimizer but rather make aware the lack of the loss function’s utility in modern gradient based optimization.</p>
<p>Zeroth order calculations add just as much information about the loss manifold as any other sample, <strong>so why is it ignored?</strong> This is because most differntiable optimization techniques use gradient based climbing schemes that only care about the local curvature.</p>
<script type="math/tex; mode=display">\begin{aligned}
x_{t+1} \leftarrow x_t \pm \eta \frac{\partial f}{\partial x_t} \\
\end{aligned}</script>
<p>Other common techniques like momentum, adagrad, adam, etc. all work on either trying to approximate higher order terms or generate an adaptive learning rate (<script type="math/tex">\eta</script> in the above) to perform smarter updates. Considering this, loss is ignored because it has no bearing on the locality of the parameters in the loss manifold. This though is only true with a single draw; accross multiple updates these observed losses can even help such an update.</p>
<p>Here are some images to examplify the importance of zeroth order info (excuse my terrible drawing):</p>
<p align="center">
<img src="/images/CubicStep/example_curves.png" />
<br /><b>different curves with the same gradients at two samples</b>
</p>
<p>Excusing my terrible drawings in the figure, the idea is that even with maintaining first order information, the actual losses change the desired minima. I want to provide a motivating use-case. I devise the Cubic Step optimizer in this post to help deal with oscillations over a critical point in parameter space. The idea is simple, rather than taking a gradient step, I find a cubic that fits the four points <script type="math/tex">l(w_0), l(w_1), l'(w_0)</script> and <script type="math/tex">l'(w_1)</script> where <script type="math/tex">w_0,w_1</script> is the current and previous parameter while <script type="math/tex">l(w)</script> is the loss function. I use this though only in the setting of an oscillating gradient, otherwise I take a normal ascent/descent step (for my experiments I use SGD but it could be any optimizer step). Denoting the fitted polynomial as <script type="math/tex">p(x) = \alpha_0 + \alpha_1 x + \alpha_2 x^2 + \alpha_3 x^3</script>, solving for the coefficients leads us to 4 equations and 4 unknowns. The matrix form is as so:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{pmatrix} 1 & w_0 & w_0^2 & w_0^3 \\
1 & w_1 & w_1^2 & w_1^3 \\
0 & 1 & 2w_0 & 3w_0^2 \\
0 & 1 & 2w_1 & 3w_1^2 \end{pmatrix}
\begin{pmatrix} \alpha_0 \\
\alpha_1 \\
\alpha_2 \\
\alpha_3 \end{pmatrix}=
\begin{pmatrix} l(w_0) \\
l(w_1) \\
l'(w_0) \\
l'(w_1) \end{pmatrix} %]]></script>
<p>or equivalently:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{pmatrix} \alpha_0 \\
\alpha_1 \\
\alpha_2 \\
\alpha_3 \end{pmatrix}=
\begin{pmatrix} 1 & w_0 & w_0^2 & w_0^3 \\
1 & w_1 & w_1^2 & w_1^3 \\
0 & 1 & 2w_0 & 3w_0^2 \\
0 & 1 & 2w_1 & 3w_1^2 \end{pmatrix}^{-1}
\begin{pmatrix} l(w_0) \\
l(w_1) \\
l'(w_0) \\
l'(w_1) \end{pmatrix} %]]></script>
<p>Its pretty easy to see the inverse exists as long as <script type="math/tex">w_0 \neq w_1</script>. In my initial implementation I got lazy and used tensorflow’s <code class="language-plaintext highlighter-rouge">tf.linalg.inv</code> but that lead to a stream of issues (It worked fine on CPU but on GPU It always caused the program to hang even though I checked that the eigenvalues were numerically stable with respect to floating point precision) so I did what we all had to freshman year of college <script type="math/tex">\rightarrow</script> I did it by hand. The tedious excersize took around 5ish minutes. You get the inverse $Z$ as so…</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
Z[:,:] &\leftarrow I_{(4x4)} \\
Z[3,:] &\leftarrow Z[3,:] - \frac{2}{(w_0-w_1)}(Z[0,:]-Z[1,:]) + Z[2,:] \\
Z[3,:] &\leftarrow \frac{Z[3,:]}{(w_1-w_0)^2} \\
Z[2,:] &\leftarrow Z[2,:] - \frac{1}{(w_0-w_1)}(Z[0,:]-Z[1,:]) - (2w_0^2 - w_0w_1 - w_1^2)Z[3,:] \\
Z[2,:] &\leftarrow \frac{Z[2,:]}{(w_0-w_1)} \\
Z[1,:] &\leftarrow Z[1,:] - Z[0,:] - (w_1^2-w_0^2)Z[2,:] - (w_1^3-w_0^3)Z[3,:] \\
Z[1,:] &\leftarrow \frac{Z[1,:]}{(w_1-w_0)} \\
Z[0,:] &\leftarrow Z[0,:] - w_0Z[1] - w_0^2Z[2,:] - w_0^3Z[3,:] \\
\end{aligned} %]]></script>
<p>Now that we did our little excersize we can actually run the optimizer. Unfortunately this entire scheme makes the difficult assumption that <script type="math/tex">l_k(w)</script> and <script type="math/tex">l_{k+1}(w)</script> are the same functions where <script type="math/tex">k</script> is the iteration or batch number. This can be gauranteed if the batch size covers the whole dataset but as the batch gets smaller the variance increases. Lets test the effects of this empirically. I use a 4096 image subset of the MNIST dataset rather than the whole 60K so I can have batches run up to 100% of the data. For the experimental setup I reinitialize the weights for each batch size but keep initializations the same accross optimizer types. Note that I use batchsizes ranging form 128 to 4096 or 3-100% of the dataset. Also Cubic Step is only utilized in the regions where the weights oscillate, so it should be heavily impacted by the batch gradient’s variance– to find safer regions I use 25 epochs of warmup of only SGD.</p>
<p align="center">
<img src="/images/CubicStep/mnist_results.png" />
<br /><b>results on a 5 layer CNN on MNIST 4096</b>
</p>
<p>These results were interesting and better than I expected. It seems the two methods perfomed nearly the same and for the most part batch percentage had very little effect on the efficacy of Cubic Step and it was also slightly smoother in some cases. The latter was the intended goal, but the former doesnt make much sense given how much this method should be dependant on intra-batch variance. I will atest that probably due to the smoothness of the shallow net’s manifold but will require further testing to be sure.</p>
<p>In conclusion the methods worked similarly but I think this result is enough to atleast give all who are reading this enough motivation to atleast keep the idea that zeroth order information is essentially free to access and could be applied in possible future work. Cubic Step only used 4 points, but in reality we can access all previous temporal samples if needed. I chose 4, because cubic was convenient to solve for critical points analytically, while post-5th order is provably impossible as a generality. Actually in a previous post, I go over Focal Gradient Loss as an intuitive competitor of Foal Loss, and that could be seen as an adaptive weighting dependant on zeroth order information too.</p>
<p>I hope you guys enjoyed the post and <a href="https://github.com/mshlis/CubicStep">here</a> is the repo for reproduction and playing yourself with Cubic Step</p>Michael ShliselbergCurrent gradient based optimization ignores 0th order information. In this post I want to motivate others to try utilizing it.[Idea] Super-Masks: Are They Good Initializations?2019-11-02T00:00:00+00:002019-11-02T00:00:00+00:00https://mshlis.github.io/SuperMasks<p>A large advancement in understanding neural networks was the discovery of <em>lottery tickets</em>. Subgraphs of initializations that are correlated to better positioning in the loss manifold can be discovered through pruning of a trained network. This was investigated after a plethora of empirical pruning results, that on mulitple state-of-the-art models parameters can be reduced up to 90% with minimal reduction to accuracy. Investigating this further, some cool people at Uber discovered <strong>Super-Masks</strong>. Super-Masks are parameter masks that find a subgraph of the models initialization that performs relatively well. This means with no actual change to the initialization weights themselves, there exists a suitable model within. How AMAZING is that???</p>
<script type="math/tex; mode=display">\begin{aligned}
\hat y = f(x; M \odot \theta_0) \\
\end{aligned}</script>
<p>So how do we solve for this mask <script type="math/tex">M</script>? Its a discrete parameter, so normal backprop wont work. The authors dont mention in the paper but after emailing them they told me they used <script type="math/tex">M = \sigma(\hat M) + StopGradient(Bern(\hat M) - \sigma(\hat M))</script> where you can actually learn <script type="math/tex">\hat M</script>. This is the expected value of the gradient of <script type="math/tex">Bern(\hat M)</script>. I find this not ideal due to the negative feedback derived from the variance and so I actually use thresholding, which can be thought of a reweighted gradient consideration, <script type="math/tex">M = \sigma(\hat M) + StopGradient((\hat M > \tau) - \sigma(\hat M))</script> where I set <script type="math/tex">\tau</script> to be just .5, I found that this approached achieved slightly better results. Note you can also use my Intermediate Loss sampling I introduced a couple blog posts back.</p>
<p>One question the supermask paper left me with is, <em>how good is it as an initialization?</em> Lottery tickets show quicker training with slightly stronger performance than the original. Does this still hold up? So lets do some preliminary experimentation. I look into mnist and cifar10, specifically using Wide Resnet 28. I modify the implementation from keras-contrib to allow for learned masking of the convolution and dense layers. For experiments, I use a batch size of 128 and a learning rate of .01 for training both the mask-only and weight-only setups for 30 epochs. Also for all training sessions we use random flips, translations and rotations for data augmentation. I then also take the best masked version, freeze the mask, unfreeze the weights and train up to 30 epochs as well. Note that for when I fine tune the supermasked model I use a reduce learning rate of 1e-5 because in practice ive noticed empricially the supermask to be extremely close to the local minimum, and any larger sized jumps almost entirely wiped out the boost of utilizing the super mask.</p>
<p align="center">
<img src="/images/SuperMask/mnist_training.png" />
<br /><b>MNIST Training</b>
</p>
<p align="center">
<img src="/images/SuperMask/cifar10_training.png" />
<br /><b>Cifar10 Training</b>
</p>
<p>As you can see in the above two figures, mnist supermasks achieve similar results with equal training times as the normal regime. In this case, I do not see much improvement from using it as an initialization, but this is probably because they all converge to similar minima quickly. Cifar10 on the other hand, we see equal training from both schemes for most of the way but then using the masked variation as an initialization we see a large jump in accuracy. This is really interesting to see. In the initial paper, on Cifar 10 they showed good results, but they did not do as well as the conv nets (except for the small model used for mnist), but here on a deep competetive model, we actually see sheer masking can do wonders. Moreso, starting from the masked initialization leads to a decent boost (the green lines in the cifar10 training figure). So lets see what else we can learn from these experiments.</p>
<p align="center">
<img src="/images/SuperMask/mnist_percent_vs_nparams.png" />
<br /><b>MNIST mask utilization</b>
</p>
<p align="center">
<img src="/images/SuperMask/cifar10_percent_vs_nparams.png" />
<br /><b>Cifar10 mask utilization</b>
</p>
<p>In the above two figures, we see a very specific pattern: if layers had less weights, the mask percent dropped heavily. I explain this with the obvious consideration of ‘less weights / filters = information bottleneck = much harder to mess with’. I do want to note that no layer masks more than ~16%, so unlike these tiny lottery tickets discovered by pruning a trained model, were seeing Dense masks on random initializations that perform amazingly both by themselves and as initializations. A good addition in the future would be to invesitgate this discrepancy between lottery tickets and these super-masks because they do seems to be working on different properties of neural networks and their loss manifold.</p>
<p>The goal of this post was to describe and play with SuperMasks and answer if <em>Super-Masks work as strong<strong>er</strong> initializations?</em> From the experiments I did, I show you can achieve a boost in performance using this strategy! Note I did only do a small subset of experiments on small datsets, so seeing if this extends to larger datasets would be a fun continuation too.</p>
<p>I hope you guys enjoyed the post and <a href="https://github.com/mshlis/SuperMasks">here</a> is the repo for reproduction and playing yourself with the Super-Masks</p>Michael ShliselbergIn this post I discuss a Super Masks and their potential as initializations, similar to the lottery ticket hypothesis[Idea] Super-Masks: Are They Good Initializations?2019-11-02T00:00:00+00:002019-11-02T00:00:00+00:00https://mshlis.github.io/SuperMasks<p>A large advancement in understanding neural networks was the discovery of <em>lottery tickets</em>. Subgraphs of initializations that are correlated to better positioning in the loss manifold can be discovered through pruning of a trained network. This was investigated after a plethora of empirical pruning results, that on mulitple state-of-the-art models parameters can be reduced up to 90% with minimal reduction to accuracy. Investigating this further, some cool people at Uber discovered <strong>Super-Masks</strong>. Super-Masks are parameter masks that find a subgraph of the models initialization that performs relatively well. This means with no actual change to the initialization weights themselves, there exists a suitable model within. How AMAZING is that???</p>
<script type="math/tex; mode=display">\begin{aligned}
\hat y = f(x; M \odot \theta_0) \\
\end{aligned}</script>
<p>So how do we solve for this mask <script type="math/tex">M</script>? Its a discrete parameter, so normal backprop wont work. The authors dont mention in the paper but after emailing them they told me they used <script type="math/tex">M = \sigma(\hat M) + StopGradient(Bern(\hat M) - \sigma(\hat M))</script> where you can actually learn <script type="math/tex">\hat M</script>. This is the expected value of the gradient of <script type="math/tex">Bern(\hat M)</script>. I find this not ideal due to the negative feedback derived from the variance and so I actually use thresholding, which can be thought of a reweighted gradient consideration, <script type="math/tex">M = \sigma(\hat M) + StopGradient((\hat M > \tau) - \sigma(\hat M))</script> where I set <script type="math/tex">\tau</script> to be just .5, I found that this approached achieved slightly better results. Note you can also use my Intermediate Loss sampling I introduced a couple blog posts back.</p>
<p>One question the supermask paper left me with is, <em>how good is it as an initialization?</em> Lottery tickets show quicker training with slightly stronger performance than the original. Does this still hold up? So lets do some preliminary experimentation. I look into mnist and cifar10, specifically using Wide Resnet 28. I modify the implementation from keras-contrib to allow for learned masking of the convolution and dense layers. For experiments, I use a batch size of 128 and a learning rate of .01 for training both the mask-only and weight-only setups for 30 epochs. Also for all training sessions we use random flips, translations and rotations for data augmentation. I then also take the best masked version, freeze the mask, unfreeze the weights and train up to 30 epochs as well. Note that for when I fine tune the supermasked model I use a reduce learning rate of 1e-5 because in practice ive noticed empricially the supermask to be extremely close to the local minimum, and any larger sized jumps almost entirely wiped out the boost of utilizing the super mask.</p>
<p align="center">
<img src="/images/SuperMask/mnist_training.png" />
<br /><b>MNIST Training</b>
</p>
<p align="center">
<img src="/images/SuperMask/cifar10_training.png" />
<br /><b>Cifar10 Training</b>
</p>
<p>As you can see in the above two figures, mnist supermasks achieve similar results with equal training times as the normal regime. In this case, I do not see much improvement from using it as an initialization, but this is probably because they all converge to similar minima quickly. Cifar10 on the other hand, we see equal training from both schemes for most of the way but then using the masked variation as an initialization we see a large jump in accuracy. This is really interesting to see. In the initial paper, on Cifar 10 they showed good results, but they did not do as well as the conv nets (except for the small model used for mnist), but here on a deep competetive model, we actually see sheer masking can do wonders. Moreso, starting from the masked initialization leads to a decent boost (the green lines in the cifar10 training figure). So lets see what else we can learn from these experiments.</p>
<p align="center">
<img src="/images/SuperMask/mnist_percent_vs_nparams.png" />
<br /><b>MNIST mask utilization</b>
</p>
<p align="center">
<img src="/images/SuperMask/cifar10_percent_vs_nparams.png" />
<br /><b>Cifar10 mask utilization</b>
</p>
<p>In the above two figures, we see a very specific pattern: if layers had less weights, the mask percent dropped heavily. I explain this with the obvious consideration of ‘less weights / filters = information bottleneck = much harder to mess with’. I do want to note that no layer masks more than ~16%, so unlike these tiny lottery tickets discovered by pruning a trained model, were seeing Dense masks on random initializations that perform amazingly both by themselves and as initializations. A good addition in the future would be to invesitgate this discrepancy between lottery tickets and these super-masks because they do seems to be working on different properties of neural networks and their loss manifold.</p>
<p>The goal of this post was to describe and play with SuperMasks and answer if <em>Super-Masks work as strong<strong>er</strong> initializations?</em> From the experiments I did, I show you can achieve a boost in performance using this strategy! Note I did only do a small subset of experiments on small datsets, so seeing if this extends to larger datasets would be a fun continuation too.</p>
<p>I hope you guys enjoyed the post and <a href="https://github.com/mshlis/SuperMasks">here</a> is the repo for reproduction and playing yourself with the Super-Masks</p>Michael ShliselbergIn this post I discuss a Super Masks and their potential as initializations, similar to the lottery ticket hypothesis[Pers] Hack Umass Mentor: How to Debug2019-10-21T00:00:00+00:002019-10-21T00:00:00+00:00https://mshlis.github.io/HackUmassMentor<p>This past weekend I had the pleasure of being a mentor at Hack Umass. Going to Umass as an ECE student, I saw the development of this event from a small hardware-only hackathon to what it is now. Now every year, me and a few friends drive up to the event to see the cool hacks and reconnect. This year though was my first as being a mentor</p>
<p align="center">
<img src="/images/HUM/nutshell.jpg" />
<br /><b>Hack Umass in a nutshell</b>
</p>
<h4 id="what-is-a-hackathon">What is a hackathon?</h4>
<p>Most reading this probably already know what a hackathon is but just in case I will reverberate it. Hackathons are events where people come together and compete in a fixed time interval to “hack” together a cool project! Hackathons can be themed or have some form of goal statement. Examples can include Healthcare, AI, Blockchain, or any buzzword of choice… or they can just be open ended (Hack Umass fits in this category). Its a great time, brainstorming, coding, food, stress, energy drinks and a lack of sleep. Its a haven.</p>
<h4 id="hack-umass">Hack Umass</h4>
<p>I was a mentor. This just includes being available for groups that have questions or need any other form of help. The group of mentors included a diverse set of volunteers with skills spanning a variety of realms of hardware and software.<br />
Unfortunately there was over 850 participants and not nearly as many mentors. So sometimes you’re left with groups requiring assistance in a domain you dont know much about. This is what I want to focus on.</p>
<p>Over 50% of the questions I ended up with regarded microcontrollers. During my degree in computer engineering I was no stranger to working with microcontrollers but its been a while and it definitely was not my expertise. <strong>So how can I help a group of students if I dont know the answer myself?</strong> Luckily accross many domains alot of the same debugging procedures apply. These include the usual suspects</p>
<ul>
<li>Is it software? You have a debugger?
<ul>
<li>No debugger means use print statements (or get a debugger if possible)</li>
<li>Sometimes print statements may just be easier (I know this is sacriledge to say, but its case to case in my opinion)</li>
<li>Check versioning</li>
<li>Test components seperately if possible with drivers/stubs (this may be expensive given the time frame, but sometimes you can build these as you go)</li>
</ul>
</li>
<li>Google the error
<ul>
<li>Include errors that link to the exernal issues</li>
<li>Include version / meta info depending on the error</li>
<li>Use possible solutions / tutorials found</li>
</ul>
</li>
<li>When using a tutorial
<ul>
<li>Make sure anything you do is reversable! (because it worked for someone else doesnt mean it will for you, and you maybe could make it worse!)</li>
<li>If one tutorial does not solve it for you, undo the changes before finding another – sometimes solutions conflict</li>
</ul>
</li>
<li>Could it be the hardware?
<ul>
<li>Noone wants to blame the hardware, seems like a cowardly scapegoat, but sometimes it really is!</li>
<li>Check components one by one if possible (drivers/stubs, voltmeter, ampmeter, replace the part, etc..)</li>
</ul>
</li>
</ul>
<p>These are just a handful of paths, but alot of the time its adaptive. Errors can by tricky and sometimes multi faceted, so it is case by case but alot of the time this pseudo-road map will get you far.</p>
<p>Also as a mentor, sometimes adding a new perspective adds something. Im sure everyone has done this: you begin a group project and everyone has their own ideas, but eventually you converge on a concept and a path but once there, you all are so focused on that, that you dont pay attention to some obvious mistakes/solutions– same is true for any form of group based problem solving.</p>
<p>Overall it was a great weekend. I got a chance to goto pita pocket (best restaurant in all of amherst with greatest falafel on this side of the planet) and saw my friends while meeting a ton of awesome people. Seeing so many people working on such amazing projects is always inspiring and always gets you revved up! It was a wonderful experience and I hope to be there again next year.</p>Michael ShliselbergThis past weekend I had the pleasure of being a mentor at Hack Umass. In this post I discuss how the experience was along with common debugging ideas that are helpful in a variety of scenarios[Pers] Hack Umass Mentor: How to Debug2019-10-21T00:00:00+00:002019-10-21T00:00:00+00:00https://mshlis.github.io/HackUmassMentor<p>This past weekend I had the pleasure of being a mentor at Hack Umass. Going to Umass as an ECE student, I saw the development of this event from a small hardware-only hackathon to what it is now. Now every year, me and a few friends drive up to the event to see the cool hacks and reconnect. This year though was my first as being a mentor</p>
<p align="center">
<img src="/images/HUM/nutshell.jpg" />
<br /><b>Hack Umass in a nutshell</b>
</p>
<h4 id="what-is-a-hackathon">What is a hackathon?</h4>
<p>Most reading this probably already know what a hackathon is but just in case I will reverberate it. Hackathons are events where people come together and compete in a fixed time interval to “hack” together a cool project! Hackathons can be themed or have some form of goal statement. Examples can include Healthcare, AI, Blockchain, or any buzzword of choice… or they can just be open ended (Hack Umass fits in this category). Its a great time, brainstorming, coding, food, stress, energy drinks and a lack of sleep. Its a haven.</p>
<h4 id="hack-umass">Hack Umass</h4>
<p>I was a mentor. This just includes being available for groups that have questions or need any other form of help. The group of mentors included a diverse set of volunteers with skills spanning a variety of realms of hardware and software.<br />
Unfortunately there was over 850 participants and not nearly as many mentors. So sometimes you’re left with groups requiring assistance in a domain you dont know much about. This is what I want to focus on.</p>
<p>Over 50% of the questions I ended up with regarded microcontrollers. During my degree in computer engineering I was no stranger to working with microcontrollers but its been a while and it definitely was not my expertise. <strong>So how can I help a group of students if I dont know the answer myself?</strong> Luckily accross many domains alot of the same debugging procedures apply. These include the usual suspects</p>
<ul>
<li>Is it software? You have a debugger?
<ul>
<li>No debugger means use print statements (or get a debugger if possible)</li>
<li>Sometimes print statements may just be easier (I know this is sacriledge to say, but its case to case in my opinion)</li>
<li>Check versioning</li>
<li>Test components seperately if possible with drivers/stubs (this may be expensive given the time frame, but sometimes you can build these as you go)</li>
</ul>
</li>
<li>Google the error
<ul>
<li>Include errors that link to the exernal issues</li>
<li>Include version / meta info depending on the error</li>
<li>Use possible solutions / tutorials found</li>
</ul>
</li>
<li>When using a tutorial
<ul>
<li>Make sure anything you do is reversable! (because it worked for someone else doesnt mean it will for you, and you maybe could make it worse!)</li>
<li>If one tutorial does not solve it for you, undo the changes before finding another – sometimes solutions conflict</li>
</ul>
</li>
<li>Could it be the hardware?
<ul>
<li>Noone wants to blame the hardware, seems like a cowardly scapegoat, but sometimes it really is!</li>
<li>Check components one by one if possible (drivers/stubs, voltmeter, ampmeter, replace the part, etc..)</li>
</ul>
</li>
</ul>
<p>These are just a handful of paths, but alot of the time its adaptive. Errors can by tricky and sometimes multi faceted, so it is case by case but alot of the time this pseudo-road map will get you far.</p>
<p>Also as a mentor, sometimes adding a new perspective adds something. Im sure everyone has done this: you begin a group project and everyone has their own ideas, but eventually you converge on a concept and a path but once there, you all are so focused on that, that you dont pay attention to some obvious mistakes/solutions– same is true for any form of group based problem solving.</p>
<p>Overall it was a great weekend. I got a chance to goto pita pocket (best restaurant in all of amherst with greatest falafel on this side of the planet) and saw my friends while meeting a ton of awesome people. Seeing so many people working on such amazing projects is always inspiring and always gets you revved up! It was a wonderful experience and I hope to be there again next year.</p>Michael ShliselbergThis past weekend I had the pleasure of being a mentor at Hack Umass. In this post I discuss how the experience was along with common debugging ideas that are helpful in a variety of scenarios[Idea] Focal Gradient Loss: Are we looking at focal loss correctly?2019-10-20T00:00:00+00:002019-10-20T00:00:00+00:00https://mshlis.github.io/FocalGradLoss<p>Retinanet is a near state-of-the-art object detector that using a simple adaptive weighting scheme, helps bridge some of the gap between one and two stage object detectors by dealing with inherent class imbalance from the large background set constructed by the anchoring process. Specifically they use Focal Loss</p>
<script type="math/tex; mode=display">FL(p) \propto - (1-p)^\gamma log(p)</script>
<p>The usage of <script type="math/tex">(1-p)^\gamma</script> down weights the categorical cross entropy loss along with how high <script type="math/tex">p</script> is. This is parameterized by <script type="math/tex">\gamma</script>, where the higher it is, the greater the down weighting is. In their paper they make it clear this choice of adaptive weighting is arbitrary and could be replaced with other schemes.</p>
<p>The question I pose with this, is does this accomplish their goal? (I asked this a while back in a <a href="https://ai.stackexchange.com/questions/13755/does-retina-nets-focal-loss-accomplish-its-goal">stack exchange question</a>). Given the lack of response in that post, I decided to investigate this myself. I propose another simple adaptive weighting scheme, but I propose applying the approach onto the gradient directly. I do this because most optimizers use a form of gradient descent <script type="math/tex">\theta \leftarrow \theta - \nabla_{\theta}L</script>, and so masking the gradient, directly masks how much is learned by each element. I call this set of losses, <em>focal gradient loss</em></p>
<script type="math/tex; mode=display">FGL(p) \propto - StopGradient((1-p)^\gamma) \ log(p)</script>
<h3 id="comparing-the-objective-functions">Comparing the objective functions</h3>
<p>Its hard to understand it comparitively with the <script type="math/tex">StopGradient</script> function, so first lets look at their gradients</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
\dot{FL(p)} &= - \frac{(1-p)^\gamma }{p} + \gamma (1-p)^{\gamma - 1} \ log(p)\\
&= -(1-p)^\gamma \ \dot{CE(p)} + \gamma (1-p)^{\gamma - 1} \ log(p)\\
&= -(1-p)^\gamma \ \dot{CE(p)} + R(p) \\
\dot{FGL(p)} &= - \frac{(1-p)^\gamma }{p} \\
&= -(1-p)^\gamma \ \dot{CE(p)} \
\end{aligned} %]]></script>
<p>Looking at these, we see the focal-loss is a masked version of adaptive weighting plus a residual, <script type="math/tex">R(p)</script>. This is the differentiating factor in the two appraoches. We can rewrite the residual as <script type="math/tex">R(p) = -\gamma * FL_{\gamma -1}(p)</script>. It is difficult to say what this does in the optimization perspective, but in terms of sheer magnitude it will always increase the gradient (because the gradient is always negative), moreso if a loss variant is high. This is difficult to describe intuitively but it may be be seen as an additive adaptive bias. This actually increases the spread if you do the math out which may be seen as a benefit but difficult to say for certain, which is why I perform this experiment.</p>
<p align="center">
<img src="/images/FocalGradientLoss/focal_gradients.png" />
</p>
<p>Even though stop_gradient makes it not easily comparable in the loss space, we can integrate its gradient to see its effective equivalent. For values of <script type="math/tex">\gamma</script> that arent integers, the integral becomes alot more complicated, so for ease lets just look at the integer case</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
L(p) &= \int_{z=1}^p -\frac{(1-z)^\gamma}{z} dz\\
&= -log(p) - \sum_{i=1}^\gamma {\gamma \choose i}\frac{(-p)^i - (-1)^i}{i} \
\end{aligned} %]]></script>
<p>Note that the bottom bound of the integral is arbitrary, because it only effects the constant, but since we set <script type="math/tex">FGL(1)=0</script> from the stop_gradient formulation, I use <script type="math/tex">z=1</script> to keep it consistent. Now plotting these we see the following</p>
<p align="center">
<img src="/images/FocalGradientLoss/focal_objectives.png" />
</p>
<p>As you can see, focal gradient loss is always lower than the respective focal loss which is understandable based on the spread differential caused by the Residual.</p>
<h3 id="experiment">Experiment</h3>
<p>Due to computational reasons, we will test on only the setting of <script type="math/tex">\gamma = 2</script> for 15 epochs on PASCAL VOC 2012. We use the keras-retinanet open source package to do this. Only adjustments I add to the repo is make the custom loss function, and I adjust the training to work with Horovod for more efficient parallelization (I will take any speedups I can get :D)</p>
<p align="center">
<img src="/images/FocalGradientLoss/fl_vs_fgl.png" />
</p>
<p>Looking at the results we see they perform similarly. Focal Gradient Loss actually does have a slightly better mAP, but this could be due to noise produced in the training procedure. There is also more noise in general in the Focal Gradient Loss’s training compared to Focal Losses– this may be due to the lack of that additive bias. From this singular (and incomplete) experiment there isnt enough to make conclusions, but if I were forced to do so, I would say the perform similarly but Focal Gradient seems less robust. <br />
I will hopefully be adding more experiments in the future. Stay tuned.</p>Michael ShliselbergIn this post I discuss a new loss function for single stage object detectors that like retinanets focal loss is adaptive, but works at the gradient level rather than at the objective function level[Idea] Focal Gradient Loss: Are we looking at focal loss correctly?2019-10-20T00:00:00+00:002019-10-20T00:00:00+00:00https://mshlis.github.io/FocalGradLoss<p>Retinanet is a near state-of-the-art object detector that using a simple adaptive weighting scheme, helps bridge some of the gap between one and two stage object detectors by dealing with inherent class imbalance from the large background set constructed by the anchoring process. Specifically they use Focal Loss</p>
<script type="math/tex; mode=display">FL(p) \propto - (1-p)^\gamma log(p)</script>
<p>The usage of <script type="math/tex">(1-p)^\gamma</script> down weights the categorical cross entropy loss along with how high <script type="math/tex">p</script> is. This is parameterized by <script type="math/tex">\gamma</script>, where the higher it is, the greater the down weighting is. In their paper they make it clear this choice of adaptive weighting is arbitrary and could be replaced with other schemes.</p>
<p>The question I pose with this, is does this accomplish their goal? (I asked this a while back in a <a href="https://ai.stackexchange.com/questions/13755/does-retina-nets-focal-loss-accomplish-its-goal">stack exchange question</a>). Given the lack of response in that post, I decided to investigate this myself. I propose another simple adaptive weighting scheme, but I propose applying the approach onto the gradient directly. I do this because most optimizers use a form of gradient descent <script type="math/tex">\theta \leftarrow \theta - \nabla_{\theta}L</script>, and so masking the gradient, directly masks how much is learned by each element. I call this set of losses, <em>focal gradient loss</em></p>
<script type="math/tex; mode=display">FGL(p) \propto - StopGradient((1-p)^\gamma) \ log(p)</script>
<h3 id="comparing-the-objective-functions">Comparing the objective functions</h3>
<p>Its hard to understand it comparitively with the <script type="math/tex">StopGradient</script> function, so first lets look at their gradients</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
\dot{FL(p)} &= - \frac{(1-p)^\gamma }{p} + \gamma (1-p)^{\gamma - 1} \ log(p)\\
&= -(1-p)^\gamma \ \dot{CE(p)} + \gamma (1-p)^{\gamma - 1} \ log(p)\\
&= -(1-p)^\gamma \ \dot{CE(p)} + R(p) \\
\dot{FGL(p)} &= - \frac{(1-p)^\gamma }{p} \\
&= -(1-p)^\gamma \ \dot{CE(p)} \
\end{aligned} %]]></script>
<p>Looking at these, we see the focal-loss is a masked version of adaptive weighting plus a residual, <script type="math/tex">R(p)</script>. This is the differentiating factor in the two appraoches. We can rewrite the residual as <script type="math/tex">R(p) = -\gamma * FL_{\gamma -1}(p)</script>. It is difficult to say what this does in the optimization perspective, but in terms of sheer magnitude it will always increase the gradient (because the gradient is always negative), moreso if a loss variant is high. This is difficult to describe intuitively but it may be be seen as an additive adaptive bias. This actually increases the spread if you do the math out which may be seen as a benefit but difficult to say for certain, which is why I perform this experiment.</p>
<p align="center">
<img src="/images/FocalGradientLoss/focal_gradients.png" />
</p>
<p>Even though stop_gradient makes it not easily comparable in the loss space, we can integrate its gradient to see its effective equivalent. For values of <script type="math/tex">\gamma</script> that arent integers, the integral becomes alot more complicated, so for ease lets just look at the integer case</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
L(p) &= \int_{z=1}^p -\frac{(1-z)^\gamma}{z} dz\\
&= -log(p) - \sum_{i=1}^\gamma {\gamma \choose i}\frac{(-p)^i - (-1)^i}{i} \
\end{aligned} %]]></script>
<p>Note that the bottom bound of the integral is arbitrary, because it only effects the constant, but since we set <script type="math/tex">FGL(1)=0</script> from the stop_gradient formulation, I use <script type="math/tex">z=1</script> to keep it consistent. Now plotting these we see the following</p>
<p align="center">
<img src="/images/FocalGradientLoss/focal_objectives.png" />
</p>
<p>As you can see, focal gradient loss is always lower than the respective focal loss which is understandable based on the spread differential caused by the Residual.</p>
<h3 id="experiment">Experiment</h3>
<p>Due to computational reasons, we will test on only the setting of <script type="math/tex">\gamma = 2</script> for 15 epochs on PASCAL VOC 2012. We use the keras-retinanet open source package to do this. Only adjustments I add to the repo is make the custom loss function, and I adjust the training to work with Horovod for more efficient parallelization (I will take any speedups I can get :D)</p>
<p align="center">
<img src="/images/FocalGradientLoss/fl_vs_fgl.png" />
</p>
<p>Looking at the results we see they perform similarly. Focal Gradient Loss actually does have a slightly better mAP, but this could be due to noise produced in the training procedure. There is also more noise in general in the Focal Gradient Loss’s training compared to Focal Losses– this may be due to the lack of that additive bias. From this singular (and incomplete) experiment there isnt enough to make conclusions, but if I were forced to do so, I would say the perform similarly but Focal Gradient seems less robust. <br />
I will hopefully be adding more experiments in the future. Stay tuned.</p>Michael ShliselbergIn this post I discuss a new loss function for single stage object detectors that like retinanets focal loss is adaptive, but works at the gradient level rather than at the objective function level[Idea] Intermediate Loss Sampling: A Differentiable Categorical Sampler2019-10-14T00:00:00+00:002019-10-14T00:00:00+00:00https://mshlis.github.io/ILSampling<p>I propose a novel sampling approach, leveraging an intermediate loss function to differentiate through a categorical draw. There exists a long history of using policy gradient techniques where only the policy network gradients are utilized, but in the last couple of years approaches like the Gumbel Softmax has surfaced. Gumbel Softmax attempts to model categorical variables through a reparametrization trick and uses softmax to approximate the argmax operator, which in result is completely differentiable. The gumbell softmax is parametrized by a temperature hyperparameter, <script type="math/tex">T</script>. at <script type="math/tex">T=0</script>, this approximation is equivalent to a draw from the categorical distribution but the gradient is undefined. As <script type="math/tex">T</script> increases, the derivative is more defined, but the sample becomes more and more smooth. This give and take is the primary issue with this approach.</p>
<h3 id="construction">Construction</h3>
<p>As a forward pass, it is simply the draw itself with no approximations. For a backpass, we define an intermediate loss function <script type="math/tex">KL(\pi_D || \pi)</script> where <script type="math/tex">\pi_D</script> is a single SGD step from solving which distribution best approximates our end objective. Note that this takes advantage that a categorical draw in one-hot form is visually equivalent to a delta function. Due to that, all it would require is some function <script type="math/tex">h</script> that maps real vectors onto some probability simplex such that <script type="math/tex">h(z) = z</script>. The actual algorithm then follows the description below</p>
<p align="center">
<img src="/images/ILS/ILS.png" width="650px" height="250px" />
<br /><b>algorithm</b>
</p>
<h3 id="toy-problems">Toy Problems</h3>
<h4 id="toy-example-1">Toy Example 1:</h4>
<p>I show the efficacy of the approach with a toy example where soft-sampling would actually be advantagous. Intermediate Loss Sampling ends up performing just as well if not better. Given a random discrete categorical distribution I solve for minimizing <script type="math/tex">\mathbb{E}[KL(\pi_{true} || \pi_{model})]</script> by only having access to singular draws at the time. Given that KL divergence is greater than 0, I show this is a suitable test by showing its an upper bound of our true object and creating a squeeze-based optimization problem.</p>
<p>The simple proof:</p>
<p align="center">
<img src="/images/ILS/ex1_0.png" height="250px" width="350px" />
</p>
<p>For my experiemnts I use temperatures of .1, 1, 10 for the gumbel softmax and stepsizes of 1e-1, 1e-2, 1e-3 for the intermediate loss sampling.</p>
<p align="center">
<img src="/images/ILS/toyexp_1.png" />
</p>
<p>The above shows that in all settings the intermediate loss approach works just as well if not better and converges almost immediately. Note that the noise in the IL settings is because of the inherent variance of using single-step draws, which gumbel-softmax wont have as it can produce smoother outputs than pure one-hot encodings.</p>
<p>I hope to add more test cases soon. Other positives of this method include that it can extend to non-categorical random variables assuming you can find a loss function that is closed-form differentiable with respect to the parameters you are learning.</p>
<p>Click <a href="https://github.com/mshlis/ILSampling">here</a> to see the Repo</p>Michael ShliselbergI propose a novel sampling approach, leveraging an intermediate loss function to differentiate through a categorical draw. I compare this method to the commonly used Gumbel Softmax[Idea] Intermediate Loss Sampling: A Differentiable Categorical Sampler2019-10-14T00:00:00+00:002019-10-14T00:00:00+00:00https://mshlis.github.io/ILSampling<p>I propose a novel sampling approach, leveraging an intermediate loss function to differentiate through a categorical draw. There exists a long history of using policy gradient techniques where only the policy network gradients are utilized, but in the last couple of years approaches like the Gumbel Softmax has surfaced. Gumbel Softmax attempts to model categorical variables through a reparametrization trick and uses softmax to approximate the argmax operator, which in result is completely differentiable. The gumbell softmax is parametrized by a temperature hyperparameter, <script type="math/tex">T</script>. at <script type="math/tex">T=0</script>, this approximation is equivalent to a draw from the categorical distribution but the gradient is undefined. As <script type="math/tex">T</script> increases, the derivative is more defined, but the sample becomes more and more smooth. This give and take is the primary issue with this approach.</p>
<h3 id="construction">Construction</h3>
<p>As a forward pass, it is simply the draw itself with no approximations. For a backpass, we define an intermediate loss function <script type="math/tex">KL(\pi_D || \pi)</script> where <script type="math/tex">\pi_D</script> is a single SGD step from solving which distribution best approximates our end objective. Note that this takes advantage that a categorical draw in one-hot form is visually equivalent to a delta function. Due to that, all it would require is some function <script type="math/tex">h</script> that maps real vectors onto some probability simplex such that <script type="math/tex">h(z) = z</script>. The actual algorithm then follows the description below</p>
<p align="center">
<img src="/images/ILS/ILS.png" width="650px" height="250px" />
<br /><b>algorithm</b>
</p>
<h3 id="toy-problems">Toy Problems</h3>
<h4 id="toy-example-1">Toy Example 1:</h4>
<p>I show the efficacy of the approach with a toy example where soft-sampling would actually be advantagous. Intermediate Loss Sampling ends up performing just as well if not better. Given a random discrete categorical distribution I solve for minimizing <script type="math/tex">\mathbb{E}[KL(\pi_{true} || \pi_{model})]</script> by only having access to singular draws at the time. Given that KL divergence is greater than 0, I show this is a suitable test by showing its an upper bound of our true object and creating a squeeze-based optimization problem.</p>
<p>The simple proof:</p>
<p align="center">
<img src="/images/ILS/ex1_0.png" height="250px" width="350px" />
</p>
<p>For my experiemnts I use temperatures of .1, 1, 10 for the gumbel softmax and stepsizes of 1e-1, 1e-2, 1e-3 for the intermediate loss sampling.</p>
<p align="center">
<img src="/images/ILS/toyexp_1.png" />
</p>
<p>The above shows that in all settings the intermediate loss approach works just as well if not better and converges almost immediately. Note that the noise in the IL settings is because of the inherent variance of using single-step draws, which gumbel-softmax wont have as it can produce smoother outputs than pure one-hot encodings.</p>
<p>I hope to add more test cases soon. Other positives of this method include that it can extend to non-categorical random variables assuming you can find a loss function that is closed-form differentiable with respect to the parameters you are learning.</p>
<p>Click <a href="https://github.com/mshlis/ILSampling">here</a> to see the Repo</p>Michael ShliselbergI propose a novel sampling approach, leveraging an intermediate loss function to differentiate through a categorical draw. I compare this method to the commonly used Gumbel Softmax