Everyone’s experience is different but I’ve been in dozens of MLE interviews (so... | Hacker News

Hacker News new | past | comments | ask | show | jobs | submit

login

janalsncm 3 months ago | parent | context | favorite | on: An overview of gradient descent optimization algor...

Everyone’s experience is different but I’ve been in dozens of MLE interviews (some of which I passed!) and have never once been asked to explain the internals of an optimizer. The interviews were all post 2020, though.

Unless someone had a very good reason I would consider it weird to use anything other than AdamW. The compute you could save on a slightly better optimizer pale in comparison to the time you will spend debugging an opaque training bug.

yobbo 3 months ago [–]

For example, if it is meaningful to use large batch sizes, the gradient variance will be lower and adam could be equivalent to just momentum.

As a model is trained, the gradient variance typically falls.

Those optimizers all work to reduce the variance of the updates in various ways.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact