Hacker News new | past | comments | ask | show | jobs | submit login
Value-Based Deep RL Scales Predictably (arxiv.org)
68 points by bearseascape 87 days ago | hide | past | favorite | 3 comments



My attempt at a summary: authors characterize the data-compute pareto front (aka how bitter is the lesson, exactly?)

For a different perspective, error vs compute, see

https://youtu.be/5eqRuVp65eY

and comments

(I particularly liked the one about string theorists rediscovering a fundamental theorem in GR decades too late-- rediscovering how to integrate happens in every field, it's nothing to be ashamed of :)


Skimmed little bits: "on policy" RL means the model has generated output and received feedback from sort of dynamic environment, which might not be scalable. Value-based off-policy means the model is trained with data that wasn't generated from the model itself exploring a dynamic environment. Instead it can be recordings. They then ask the question; how does that scale?


RL is unbelievably finicky, sensitive to hyperparameters, and hard to make work. If you are pursuing a research project and decide to use RL, you are making your life a lot more difficult and stressful.

It's exciting to see any progress in making accurate predictions about what settings will work for RL training. I hope that this research direction can be expanded in scope and that ultimately, people who want to do research in RL can become confident in their training recipes.

I am more excited about that, than about the dream of scaling compute per se.




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: