Schedule-Free Learning – A New Way to Train

johndough · on April 6, 2024

I did a quick comparison on MNIST with a small ConvNet, comparing this AdamWSCheduleFree optimizer against a few other optimizers (RAdam, NAdam, AdamW, SGD, Adam, Adafactor, SophiaG). The validation accuracy seems to be okay and the train loss decreases remarkably quickly.

Validation accuracy: https://i.imgur.com/8ZtX7Rd.png

Train loss: https://i.imgur.com/o5XdQ29.png

Code: https://bpa.st/NVJQ (currently only runs on my computer, but not enough time to clean it up)

Note that this is just a toy benchmark with very little hyperparameter tuning. You could probably get similar results with most optimizers and an appropriate schedule. Nevertheless, I appreciate every hyperparameter that I do not have to set manually.

In summary, this seems to be a promising optimizer. I'll add it to my list of optimizers to try for new deep learning projects.

wanderingmind · on April 7, 2024

<"I'll add it to my list of optimizers to try for new deep learning projects. "

Can you share the list of your go to optimizers outside of the Adam family?

sdenton4 · on April 7, 2024

I think there's Adam and "Nothing is obviously substantially better than Adam so why bother?"

neggles · on April 7, 2024

I've had a lot of luck with CAME https://arxiv.org/abs/2307.02047

johndough · on April 7, 2024

> Can you share the list of your go to optimizers outside of the Adam family?

Sure! It depends a bit on what I'm doing.

If I want to optimize someone else's model, I start with Adam, because that's most-likely what the hyperparameters have been optimized for. Once I've verified that Adam works, I'll try other optimizers.

If I have very few parameters and don't care about overfitting, I try LBFGS, which usually gets to the local optimum the fastest. Note that this will likely find a sharp local optimum. For better generalization performance, you often prefer a wide optimum, so the model still works if there is a bit of drift in the data.

If I do not want to mess around with learning rates, I use Adafactor, which is a bit slower, but usually works okay without any tuning.

If I had very little memory available, I'd use SGD, but in my opinion it's not worth the hassle of tuning learning rate, momentum, dampening and weight decay. I'd rather use a smaller model if possible.

I usually do not train with extremely large batch sizes, but if I did, I'd try the optimizers which claim to work well for large batch sizes.

All in all, it probably does not matter too much which optimizer you are using, as long as you tuned it a little bit. Same goes for the model, loss functions, activation functions and all that other fluff.

What /is/ important is that you design your problem in such a way that it is as easy as possible to solve. For example, it is very difficult to read arbitrary hand-written text from an image. If you have control over where the data comes from, it would be better to write the text character by character into a printed grid with additional optical markers for image registration. Or even better, replace it with a multiple choice list. If there are not too many exceptional cases, an "other" option for manual review could be added. Often, automating 99 % of the work is more than good enough and it is better to keep a human in the loop to handle edge cases.

Secondly, control the data capture as strictly as possible. For example, use uniform lightning, place the object to recognize at exactly the same position, exclude disruptive elements, etc.

Lastly, data is king. If your training data does not match the test data, you can train all you want and still get garbage results. Either collect enough training data to cover all test cases or, if that is not possible from the start, retrain with new data regularly. Data augmentation might help to some degree, but it is impossible to predict everything.

tysam_and · on April 6, 2024

This is a pretty hyped-up optimizer that seems to have okay-ish performance in-practice, but there are a number of major red flags here. For one, the baselines are decently sandbagged, but the twitter posts sharing them (which are pretty hype-y) directly says that the baselines are "highly tuned" and that there's no benchmark trickery (which is flat-out wrong). If someone has not had experience with said benchmarks, it is a plausible statement, having worked with some these datasets very closely, some of the baselines are simply terrible, I don't know where they came from.

Additionally, the optimizer does actually appear to have a kind of momentum, despite claims directly saying the contrary, and uses it with a nesterov-like step (line 2 of 3 in the inner loop). Finally, it is 'schedule-free' because the schedule is actually hardcoded into the algorithm itself -- 1./steps_taken which is not necessarily a rare learning rate schedule. This is a decently robust but sometimes suboptimal schedule, and I find it sketchy to make claims that it is 'schedule-free'. This also cripples the optimizer by tying performance to the number of steps taken -- which is potentially a problem if you are using any batchsize+lr scaling strategies as I understand.

There is a mixture of hype and substance here, and I wish the author was more straightforward with their approach and claims. I think there is the potential for a good "bolts-included" optimizer with some of the ideas being presented here, but the amount of overhyping and deception makes me not want to trust any of the following work coming.

Unfortunately, hype is what sells best on Twitter, and some of the claims being made here appear to be at the very best deceptive, and at the very worst, untrue. I could be wrong -- these are just my personal opinions from my own experience, but I do occasionally find myself distraught about the things that tend to catch wind in the technical news cycle.

-Fern

aarondefazio · on April 7, 2024

The behavior is actually more complex than a 1/t schedule. It behaves like a linear decay schedule 1-t/T with fixed stopping time T, as if T had been chosen in advance as the current timestep. When warmup is included, this is similar to high performance triangular learning rate schedules. Schedules of the form 1/t schedules perform really poorly in practice, we actually did a large scale comparison that included them in a prior paper: https://arxiv.org/pdf/2310.07831.pdf

danielhanchen · on April 7, 2024

My main current concerns are I tried asking for a transformer benchmark to see if this worked on transformers, but didn't get any response. Also they seem particularly focused on CNN type benchmarks, but did not bother to benchmark superconvergence + Ranger21 + the learning rate range finder, since they explicitiy said Schedule-Free needs tuning as well.

Their past research on D-Adpatation (won ICML best paper 2023) and their follow up work Prodigy all did worse / similar than AdamW, so maybe this works on CNNs, but does not on transformers - but for CNNs we have superconvergence.

I shall wait for their paper which will come in 1-2 months.

shostack · on April 6, 2024

And here I was hoping for something related to how to approach self-driven learning and education when you have a hectic and unpredictable schedule and are trying to fit learning in-between things with the fragments of time you have.

p1esk · on April 6, 2024

Models learn so you don’t have to :)

zingelshuher · on April 7, 2024

Good luck with that ;) Actually you _can_ learn juggling this way, just couple of minutes at a time.

danielhanchen · on April 7, 2024

I was asking on Twitter if Aaron had any experiments for transformers, since they provided some graphs for CNNs and the like, but no transformers.

* Aaron et al's past work on D-Adaptation won a best ICML paper, with their follow up work being Prodigy - but both on transformers did similar or worse than AdamW. https://twitter.com/danielhanchen/status/1775547139248341125

* Superconvergence + LR range finder + Fast AI's Ranger21 optimizer was the goto optimizer for CNNs, and worked fabulously well, but on transformers, the learning rate range finder sadi 1e-3 was the best, whilst 1e-5 was better. However, the 1 cycle learning rate stuck. https://github.com/huggingface/transformers/issues/16013

* A huge issue is this needs tuning??! But how about a well tuned AdamW? Eg see https://twitter.com/kellerjordan0/status/1776716388037529843 which outperformed it using a tuned SGD.

* I'm just a little bit reserved for now since the author themselves aren't providing any transformer benchmarks, nor have they compared their CNN baselines to superconvergence, which is the goto standard for fast training for CNNs. Likewise https://parameterfree.com/2023/08/30/yet-another-icml-award-... wasn't pleasant.

jph00 · on April 7, 2024

There's some strong baselines for CNNs here that would be great to compare to.

https://github.com/fastai/imagenette

danielhanchen · on April 7, 2024

Yes exactly this!! If Aaron et al are willing to place their optimizer on this leaderboard, that'll be fantastic. And https://dawn.cs.stanford.edu/benchmark/ImageNet/train.html - (ignoring the top 3 due to massively more GPUs)

The issue I have with Schedule-Free is you need tuning, but a well tuned SOTA can already skyrocket past a plain tuned AdamW.

tysam_and · on April 7, 2024

hey dont forget about david me and keller (he is currently the champ and has good pareto configs for not just 94 but also 95 and 96 % : https://github.com/KellerJordan/cifar10-airbench)

danielhanchen · on April 7, 2024

I saw the tweet https://twitter.com/kellerjordan0/status/1776716388037529843 and https://twitter.com/Sree_Harsha_N/status/1776733692477550875... showing a 3rd party verification!

tysam_and · on April 7, 2024

Yeah, I saw the work from @Sree_Harsha_N, though that accuracy plot on the Adam/SGD side of things is very untuned, it was about what one could expect from an afternoon of working with it, but as far as baselines go most people in the weeds with optimizers would recognize that it's pretty not-good for comparison (not to dump on the reproduction efforts).

Hence why I think it might be hard to accurately compare them, likely SGD and Adam/AdamW are going to have better potential top ends but are going to get more thrashed in public comparisons vs an optimizer that seems to perform more flatly overall. Aaron works at FAIR so I am assuming that he knows this, I reached out with some concerns on my end a little bit before he published the optimizer but didn't hear back either unfortunately.

danielhanchen · on April 7, 2024

I will also try airbench - love the clean code and am I mis-reading you need 46 SECONDS to reach 96%?? That is crazy!

tysam_and · on April 7, 2024

yeah it's been crazy to see how things have changed and im really glad that theres still interest in optimizing things for these benchmarks. ;P keller's pretty meticulous and has put in a lot of work for this from what i understand. im not sure where david's code came from originally, but it definitely impacted my code as i referenced it heavily when writing mine, and keller rewrote a lot of my code with his style + the improvements that he made in turn. hopefully the pedigree of minimal code can continue as a tradition, it really has a surprising impact

96 legitimately is pretty hard, i struggled doing it even in 2 minutes, so seeing it in 45 seconds is crazy. definitely gets exponentially harder for every fraction of a percent, so i think that's a pretty big achievement to hit :D

danielhanchen · on April 7, 2024

Ye this is a massive achievement indeed - I was quite astounded - I 100% will run this and I wanna read up on the paper - https://arxiv.org/pdf/2404.00498.pdf!

rand0mwalk · on April 6, 2024

Is there an accompanying paper out there?

miven · on April 6, 2024

I think the author of this method said it's coming in a month or so

blt · on April 6, 2024

+1, curious to see if the paper has a convergence rate proof for convex objectives.

cutkosky · on April 6, 2024

(an) author here: paper will likely be coming out in O(month). But, yes it turns out that the method is minimax optimal for stochastic convex optimization for a wide variety of parameter settings. Of course, minimax optimality alone does not fully explain empirical success - we've had minimax optimal algorithms for decades!

xpe · on April 6, 2024

This would have been witty IMO: "the paper will be out in O(negative_peer_reviews)"

> (an) author here: paper will likely be coming out in O(month)

Ug. I'm adding "O(month)" to my list of bootless metaphors.

Why? (1) Because in Big-O notation, O(month) would equal O(day), which is not the intended meaning in the comment above; (2) It is non-sensical; one would never say e.g. "the run-time of an algorithm is O(seconds)" -- we write some kind of input inside the parens, not the output

Anyhow, we already have the words roughly and about; e.g. "about a month".

Feel free to call me pedantic, but words matter.

goldemerald · on April 6, 2024

I thought it was a clever/nerdy way to say in the worst case it will be out in a month. I imagine they have an internal review they have to get through first, and it's not clear if that will be done next week or in May.

pigpang · on April 7, 2024

We can safely assume that approximate time needed to produce papers is t_approx = O(n), where n is number of papers. O(t_approx) makes no sense.

xpe · on April 8, 2024

I think the above comment probably meant "where n is number of _pages_".

blt · on April 7, 2024

agree, just a good "sanity check" for first-order optimization algos.

JieJie · on April 6, 2024

From the Related Work section (best guess):

Stochastic Weight Averaging (Izmailov et al 2018) https://arxiv.org/abs/1803.05407

Latest Weight Averaging (Kaddour 2022) https://arxiv.org/abs/2209.14981

Latest Weight Averaging? (Sanyal et al 2023) https://arxiv.org/abs/2311.16294

Cyclic Learning Rates (Portes et al 2022) https://arxiv.org/abs/2206.00832

Exponential Moving Average? (Zhanghan? et al 2019) https://arxiv.org/abs/1909.01804

mkl · on April 7, 2024

Those are other people's papers about other methods.

JieJie · on April 8, 2024

My mistake. I misread the comment that it was looking for links to the included research and I went to find them.

ShamelessC · on April 6, 2024

When starting out in deep learning, I just used a static learning rate with the Adam optimizer (no LR scheduler). Generally worked fine.

zingelshuher · on April 7, 2024

There are unstable cases when static learning rate doesn't work. Solution starts wobbling too much after some time and explodes. Using too small LR from the beginning leads to local minima. Making it stable _is_ possible, but it's a different story.

sdenton4 · on April 7, 2024

There's a particular parameter (epsilon) in Adam which is typically set to a bad default which causes instability when the gradient gets sufficiently small. It is far easier to set epsilon to 0.001 or so than muck around with learning rate schedules...

Here's another person in stack exchange who figured this out: https://stackoverflow.com/a/44844544

Pytorch and TG both use a default 1e-8.

zingelshuher · on April 8, 2024

"the bigger you make epsilon "... " thus slower the training progress will be"

Sounds like variable epsilon is optimal, that's instead of learning rate, or both together. Would be nice if this can somehow be algorithmically regulated in generic way.

sdenton4 · on April 8, 2024

The training slowdown is not really a problem... There's a pretty wide range of robust, good-enough values that don't slow things down much at all. As with all optimizer cruft, the 'optimal' value is going to be problem-dependent and a pain in the butt to actually find. So it's best to find a good-enough value that works in most contexts and not worry about it.

knightoffaith · on April 7, 2024

I hope someone's submitted a PR!

sdenton4 · on April 7, 2024

There's used to be a note about it on the tensorflow docs; they're keeping the bad default because it's the default, and changing it would potentially change behavior unexpectedly for lots of users.

yinser · on April 6, 2024

I’m continually impressed by Meta/FAIR’s contributions to the open AI space. Never thought I’d say that

Skyy93 · on April 6, 2024

Same, ... they did soooo much the last few years with LLAMA or SAM or Lecun's self-supervised methods

knightoffaith · on April 7, 2024

Yeah, and I suppose LeCun has been talking about the power of self-supervised learning long before the rise of LLMs.

wwilim · on April 7, 2024

By the title I thought this was for humans and I got excited for nothing