I did a quick comparison on MNIST with a small ConvNet, comparing this AdamWSCheduleFree optimizer against a few other optimizers (RAdam, NAdam, AdamW, SGD, Adam, Adafactor, SophiaG). The validation accuracy seems to be okay and the train loss decreases remarkably quickly.
Code: https://bpa.st/NVJQ (currently only runs on my computer, but not enough time to clean it up)
Note that this is just a toy benchmark with very little hyperparameter tuning. You could probably get similar results with most optimizers and an appropriate schedule. Nevertheless, I appreciate every hyperparameter that I do not have to set manually.
In summary, this seems to be a promising optimizer. I'll add it to my list of optimizers to try for new deep learning projects.
> Can you share the list of your go to optimizers outside of the Adam family?
Sure! It depends a bit on what I'm doing.
If I want to optimize someone else's model, I start with Adam, because that's most-likely what the hyperparameters have been optimized for. Once I've verified that Adam works, I'll try other optimizers.
If I have very few parameters and don't care about overfitting, I try LBFGS, which usually gets to the local optimum the fastest. Note that this will likely find a sharp local optimum. For better generalization performance, you often prefer a wide optimum, so the model still works if there is a bit of drift in the data.
If I do not want to mess around with learning rates, I use Adafactor, which is a bit slower, but usually works okay without any tuning.
If I had very little memory available, I'd use SGD, but in my opinion it's not worth the hassle of tuning learning rate, momentum, dampening and weight decay. I'd rather use a smaller model if possible.
I usually do not train with extremely large batch sizes, but if I did, I'd try the optimizers which claim to work well for large batch sizes.
All in all, it probably does not matter too much which optimizer you are using, as long as you tuned it a little bit. Same goes for the model, loss functions, activation functions and all that other fluff.
What /is/ important is that you design your problem in such a way that it is as easy as possible to solve. For example, it is very difficult to read arbitrary hand-written text from an image. If you have control over where the data comes from, it would be better to write the text character by character into a printed grid with additional optical markers for image registration. Or even better, replace it with a multiple choice list. If there are not too many exceptional cases, an "other" option for manual review could be added. Often, automating 99 % of the work is more than good enough and it is better to keep a human in the loop to handle edge cases.
Secondly, control the data capture as strictly as possible. For example, use uniform lightning, place the object to recognize at exactly the same position, exclude disruptive elements, etc.
Lastly, data is king. If your training data does not match the test data, you can train all you want and still get garbage results. Either collect enough training data to cover all test cases or, if that is not possible from the start, retrain with new data regularly. Data augmentation might help to some degree, but it is impossible to predict everything.
This is a pretty hyped-up optimizer that seems to have okay-ish performance in-practice, but there are a number of major red flags here. For one, the baselines are decently sandbagged, but the twitter posts sharing them (which are pretty hype-y) directly says that the baselines are "highly tuned" and that there's no benchmark trickery (which is flat-out wrong). If someone has not had experience with said benchmarks, it is a plausible statement, having worked with some these datasets very closely, some of the baselines are simply terrible, I don't know where they came from.
Additionally, the optimizer does actually appear to have a kind of momentum, despite claims directly saying the contrary, and uses it with a nesterov-like step (line 2 of 3 in the inner loop). Finally, it is 'schedule-free' because the schedule is actually hardcoded into the algorithm itself -- 1./steps_taken which is not necessarily a rare learning rate schedule. This is a decently robust but sometimes suboptimal schedule, and I find it sketchy to make claims that it is 'schedule-free'. This also cripples the optimizer by tying performance to the number of steps taken -- which is potentially a problem if you are using any batchsize+lr scaling strategies as I understand.
There is a mixture of hype and substance here, and I wish the author was more straightforward with their approach and claims. I think there is the potential for a good "bolts-included" optimizer with some of the ideas being presented here, but the amount of overhyping and deception makes me not want to trust any of the following work coming.
Unfortunately, hype is what sells best on Twitter, and some of the claims being made here appear to be at the very best deceptive, and at the very worst, untrue. I could be wrong -- these are just my personal opinions from my own experience, but I do occasionally find myself distraught about the things that tend to catch wind in the technical news cycle.
The behavior is actually more complex than a 1/t schedule. It behaves like a linear decay schedule 1-t/T with fixed stopping time T, as if T had been chosen in advance as the current timestep. When warmup is included, this is similar to high performance triangular learning rate schedules.
Schedules of the form 1/t schedules perform really poorly in practice, we actually did a large scale comparison that included them in a prior paper: https://arxiv.org/pdf/2310.07831.pdf
My main current concerns are I tried asking for a transformer benchmark to see if this worked on transformers, but didn't get any response. Also they seem particularly focused on CNN type benchmarks, but did not bother to benchmark superconvergence + Ranger21 + the learning rate range finder, since they explicitiy said Schedule-Free needs tuning as well.
Their past research on D-Adpatation (won ICML best paper 2023) and their follow up work Prodigy all did worse / similar than AdamW, so maybe this works on CNNs, but does not on transformers - but for CNNs we have superconvergence.
I shall wait for their paper which will come in 1-2 months.
And here I was hoping for something related to how to approach self-driven learning and education when you have a hectic and unpredictable schedule and are trying to fit learning in-between things with the fragments of time you have.
* Superconvergence + LR range finder + Fast AI's Ranger21 optimizer was the goto optimizer for CNNs, and worked fabulously well, but on transformers, the learning rate range finder sadi 1e-3 was the best, whilst 1e-5 was better. However, the 1 cycle learning rate stuck. https://github.com/huggingface/transformers/issues/16013
* I'm just a little bit reserved for now since the author themselves aren't providing any transformer benchmarks, nor have they compared their CNN baselines to superconvergence, which is the goto standard for fast training for CNNs. Likewise https://parameterfree.com/2023/08/30/yet-another-icml-award-... wasn't pleasant.
Yeah, I saw the work from @Sree_Harsha_N, though that accuracy plot on the Adam/SGD side of things is very untuned, it was about what one could expect from an afternoon of working with it, but as far as baselines go most people in the weeds with optimizers would recognize that it's pretty not-good for comparison (not to dump on the reproduction efforts).
Hence why I think it might be hard to accurately compare them, likely SGD and Adam/AdamW are going to have better potential top ends but are going to get more thrashed in public comparisons vs an optimizer that seems to perform more flatly overall. Aaron works at FAIR so I am assuming that he knows this, I reached out with some concerns on my end a little bit before he published the optimizer but didn't hear back either unfortunately.
yeah it's been crazy to see how things have changed and im really glad that theres still interest in optimizing things for these benchmarks. ;P keller's pretty meticulous and has put in a lot of work for this from what i understand. im not sure where david's code came from originally, but it definitely impacted my code as i referenced it heavily when writing mine, and keller rewrote a lot of my code with his style + the improvements that he made in turn. hopefully the pedigree of minimal code can continue as a tradition, it really has a surprising impact
96 legitimately is pretty hard, i struggled doing it even in 2 minutes, so seeing it in 45 seconds is crazy. definitely gets exponentially harder for every fraction of a percent, so i think that's a pretty big achievement to hit :D
Ye this is a massive achievement indeed - I was quite astounded - I 100% will run this and I wanna read up on the paper - https://arxiv.org/pdf/2404.00498.pdf!
(an) author here: paper will likely be coming out in O(month). But, yes it turns out that the method is minimax optimal for stochastic convex optimization for a wide variety of parameter settings. Of course, minimax optimality alone does not fully explain empirical success - we've had minimax optimal algorithms for decades!
This would have been witty IMO: "the paper will be out in O(negative_peer_reviews)"
> (an) author here: paper will likely be coming out in O(month)
Ug. I'm adding "O(month)" to my list of bootless metaphors.
Why? (1) Because in Big-O notation, O(month) would equal O(day), which is not the intended meaning in the comment above; (2) It is non-sensical; one would never say e.g. "the run-time of an algorithm is O(seconds)" -- we write some kind of input inside the parens, not the output
Anyhow, we already have the words roughly and about; e.g. "about a month".
I thought it was a clever/nerdy way to say in the worst case it will be out in a month. I imagine they have an internal review they have to get through first, and it's not clear if that will be done next week or in May.
There are unstable cases when static learning rate doesn't work. Solution starts wobbling too much after some time and explodes. Using too small LR from the beginning leads to local minima. Making it stable _is_ possible, but it's a different story.
There's a particular parameter (epsilon) in Adam which is typically set to a bad default which causes instability when the gradient gets sufficiently small. It is far easier to set epsilon to 0.001 or so than muck around with learning rate schedules...
"the bigger you make epsilon "... " thus slower the training progress will be"
Sounds like variable epsilon is optimal, that's instead of learning rate, or both together. Would be nice if this can somehow be algorithmically regulated in generic way.
The training slowdown is not really a problem... There's a pretty wide range of robust, good-enough values that don't slow things down much at all. As with all optimizer cruft, the 'optimal' value is going to be problem-dependent and a pain in the butt to actually find. So it's best to find a good-enough value that works in most contexts and not worry about it.
There's used to be a note about it on the tensorflow docs; they're keeping the bad default because it's the default, and changing it would potentially change behavior unexpectedly for lots of users.
Validation accuracy: https://i.imgur.com/8ZtX7Rd.png
Train loss: https://i.imgur.com/o5XdQ29.png
Code: https://bpa.st/NVJQ (currently only runs on my computer, but not enough time to clean it up)
Note that this is just a toy benchmark with very little hyperparameter tuning. You could probably get similar results with most optimizers and an appropriate schedule. Nevertheless, I appreciate every hyperparameter that I do not have to set manually.
In summary, this seems to be a promising optimizer. I'll add it to my list of optimizers to try for new deep learning projects.