Human-level control through deep reinforcement learning

erostrate · on Feb 25, 2015

The code is online if you want to play with it. https://sites.google.com/a/deepmind.com/dqn/

If you're interested, one of the main authors (David Silver) teaches a very good and intuitive introductory class on reinforcement learning at UCL: http://www0.cs.ucl.ac.uk/staff/d.silver/web/Teaching.html

_ntka · on Feb 25, 2015

Interesting that they're using Torch7. The code is pretty concise and readable, very cool stuff.

yablak · on Feb 25, 2015

They wrote Torch7...

_ntka · on Feb 26, 2015

I'm sure a number of people who have contributed to Torch are working at DeepMind. However, Torch has been around for much longer than DeepMind (about 12 years at this point). Two of the major contributors to Torch, Ronan Collobert and Clement Farabet, were never DeepMind employees.

To be fair, another major contributor to Torch is a co-author of this paper (Kavukcuoglu).

bmh100 · on Feb 25, 2015

> ...the authors used the same algorithm, network architecture, and hyperparameters on each game...

This is huge. It shows that the algorithm was able to generalize across multiple problem sets within the same ___domain of "playing Atari 2600 games", and not simply a "lucky" choice of algorithm, network architecture, or hyperparameters that a random search for each game might choose. This is also not a violation of the No Free Lunch (NFL) Theorem [1], because the ___domain is limited to playing Atari 2600 games, which share many characteristics.

[1]: https://en.wikipedia.org/wiki/No_free_lunch_in_search_and_op...

chriswarbo · on Feb 26, 2015

The No Free Lunch Theorems don't apply in reality anyway, as explained at https://en.wikipedia.org/wiki/No_free_lunch_in_search_and_op...

Essentially: NFL says the performance (1 / number of trials are required to find a solution), when averaged over all fitness functions, will be the same for all optimisation algorithms. However, most fitness functions require more computational power than the Universe to evaluate; making even a single trial impossible to perform.

Since those problems are already impossible, it doesn't matter how many trials our optimisation algorithm would require when tackling them. We can use this unobservable bad performance to offset some observable good performance, to remain within the average dictated by NFL whilst still becoming objectively better.

In fact, one of this paper's authors (Shane Legg) has used this fact to define a general intelligence test which favours performance on simple functions http://arxiv.org/abs/0712.3329

bsdetector · on Feb 26, 2015

> the algorithm was able to generalize across multiple problem sets

Did it really? I think they reset it and retrained it for each game.

I'd like to know how much more is needed to make one instance of the AI that can successfully play any of the games. To play all 49 games that it could learn, does it need to be an extra level deep? Or 49 times larger? Or 2^49 times more?

DavidSJ · on Feb 26, 2015

What they mean by "the algorithm" is the learning algorithm, not the actual learned behavior. Historically in machine learning a great deal of hand tuning of learning algorithms is required for them to even be effective on a single problem, so it's impressive that a single algorithm was able to learn to play many different games effectively.

Having a single instance of an algorithm learn to perform well on many different tasks is really a separate problem than what they're dealing with here.

ZeroFries · on Feb 26, 2015

I think you could establish an upperbound at 50 times larger, because each set of "neurons" could be completely independent for each game, with 49 neurons to determine which set to use.

If they have similar features (which they undoubtedly would), the size could drop a lot.

Of course, I have no training in neural nets, and so my conclusions are reached from general understanding/reasoning.

meric · on Feb 26, 2015

Can one human successfully play all of the games without prior practice in each game? As far as I know a human has to practice almost every game to be able to play all of them without losses. I think for an AI, achieving this standard is a good result - first practicing in each game and then play through all of them without losses.

bsdetector · on Feb 26, 2015

A person that learns 1 game will learn the next game much faster, because they have learned a concept such as a bullet or switch or reflection or wrapping. We take this for granted, but there was a time when Breakout was actually marginally fun because it was new.

A person that's played all the other games in the list can win Montezuma's Revenge on the first try; this AI can't play Montezuma's Revenge at all.

sjtrny · on Feb 25, 2015

Watch it play:

http://www.nature.com/nature/journal/v518/n7540/extref/natur...

dwaltrip · on Feb 26, 2015

This is so cool. I'd love to work on this stuff...

Anyone know how hard it would be for someone who is fairly good at programming (works as a full stack developer and feels quite comfortable learning new things) and has strong math skills (undergrad degree) to break into this field? Is going back to school for a masters/phd the best way?

ratsimihah · on Feb 26, 2015

How about you read the article and get a few books on the relevant topics? It would probably be much cheaper than going back to school.

dwaltrip · on Feb 26, 2015

Very good point. I taught myself web dev (now working at a pretty awesome startup) so I'm definitely familiar with that route.

I have a few cool AI ideas I'm hoping to start spending more time in the coming months, and I have heard of some great online courses to check out. I was just curious as to how important institutional credentials are for this kind of thing, seeing as it much more academic than building CRUD web apps.

meric · on Feb 26, 2015

I think there are lots of AI competitions where you can join and make your name.

dabeeeenster · on Feb 26, 2015

I did an M.SC in CogSci a long way back (1998) and funnily enough my thesis area was very very close to this (reinforcement learning with different network topologies).

The core area is pretty simple stuff to be honest. Obviously DeepMind are completely next-level but you can get pretty good results with basic understanding...

superfx · on Feb 25, 2015

Here's a publicly-accessible link to the full paper:

http://rdcu.be/cdlg

jmnhr · on Feb 25, 2015

Only the first page, the rest is blurred and has to be paid for.

frandroid · on Feb 25, 2015

Let it load...

jmnhr · on Feb 25, 2015

They are using tricks. When I tried the first time the pdf was blurred and the page automatically opened the payment menu. On the second try it showed the entire pdf, disallowing download.

leereeves · on Feb 26, 2015

It loaded completely for me the first time, but doesn't allow download.

A step toward open access, but they're still trying to claim copyright and control the work of others.

j_m_b · on Feb 25, 2015

It is interesting how they are using various biological models to develop their own model. They gave their model a reward system and a memory. It will be interesting to see how far deep Q-networks can be extended and at what point they hit the wall of diminishing returns.

|Nevertheless, games demanding more temporally extended planning strategies still constitute a major challenge for all existing agents including DQN.

|Notably, the succesfsful integration of reinforcement learning with deep network architectures was critically dependent on our incorporation of a replay algorithm involving the storage and representations of recently experienced transitions.

I am not for sure what data the replay algorithm has access to, but I wonder what happens if you extend the amount of data it has. This might be the brick wall this algorithm hits of diminishing returns.

It would be interesting to hear what the authors think could help help improve how their model deals with temporally extended planning strategies.

As someone who grew up on Atari, Nintendo and Sony this is pretty cool work.

morenoh149 · on Feb 26, 2015

I expect it could go far. Mind you I only did parts of Artificial Intellegence: A modern approach but the Q-learning algorthim seems very flexible. https://en.wikipedia.org/wiki/Q-learning It basically keeps doing good stuff, while exploring to get out of local minimas.

albertzeyer · on Feb 26, 2015

An interesting critic by Schmidhuber about this publication:

https://plus.google.com/100849856540000067209/posts/eLQf4KC9...

arvinjoar · on Feb 26, 2015

Seems to be critiquing the claim that this is "groundbreaking" and not much else. Nice to get some further context though. :)

nl · on Feb 25, 2015

Is this a different paper to the original DeepMind video game paper? http://arxiv.org/abs/1312.5602

sp332 · on Feb 25, 2015

Yes, I can't access the full paper but at least the figures are different :)

Edit: Ars Technica has a summary of this new paper. https://arstechnica.com/science/2015/02/ai-masters-49-atari-...

devindotcom · on Feb 26, 2015

Yes, they addressed this on the conference call I was on. That was a sort of 'initial results' they wanted to get out there, this is the full research. Same concept, explored more deeply.

Houshalter · on Feb 25, 2015

They claim good results on 49 games while the original paper only worked on a few IIRC.

discardorama · on Feb 25, 2015

Is there a chance this paper will be available as PDF? I' finding it difficult to read the readcube version. :-(

p1esk · on Feb 26, 2015

http://www.nature.com/nature/journal/v518/n7540/pdf/nature14...

teraflop · on Feb 26, 2015

Maybe you have institutional access or something, but for the rest of us, that link just redirects to the abstract.

javierluraschi · on Feb 26, 2015

I think qlearning is really interesting, I posted yesterday a simple implementation/demo in Javascript of qlearning. This paper goes way beyond qlearning by deducing states based on a deep neural network from the actual game rendering, really cool. Regardless, as a first intro to qlearning I had fun putting this together https://news.ycombinator.com/item?id=9105818

javierluraschi · on Feb 26, 2015

Here is the marketing side of this publication in which Google scientists (aquihired from Deepmind) have developed a way to outperform humans in Atari games: http://m.phys.org/news/2015-02-hal-bests-humans-space-invade...

plinkplonk · on Feb 26, 2015

Is the paper available anywhere to read without having to pay Nature? From the comments it seems as if everyone is able to read this but me! Even in their "readcube" access method, only the first page is (barely) visible, the rest seems blurred.

nl · on Feb 25, 2015

The most interesting thing about this is that it shows significant progress towards goal-oriented AI. The fact this system is effectively learning what "win" means in the context of a game is something of a breakthrough.

eveningcoffee · on Feb 25, 2015

I do not think that it figures out "what the win is" as the score parameter is explicitly made available to the algorithm.

In some sense this paper even demonstrates that how simple the problem actually is.

I think more important question is that what else can be modelled as such problem.

Houshalter · on Feb 25, 2015

Reinforcement learning is not new. The same algorithm was used in the 90s to beat backgammon. The only notable thing is they are using raw pixels and vision to play the games.

craftit · on Feb 26, 2015

It is an amazingly powerful technique. We've been working on a service which lets you do this kind of learning with any JSON stream. You can see a demo here:

https://aiseedo.com/demos/cookiemonster/

ya3r · on Feb 26, 2015

The amazing part of what DeepMind has achieved is its capability to learn from raw pixel input with deep convolutional neural networks, which as I understand it quite different from what you do.

Still the reinforcement learning part is the same, but reinforcement learning was not the main contribution of this nature paper.

craftit · on Feb 26, 2015

Its not all that different, we take multiple asynchronous streams of messages integrate them into a coherent predictive model, and use that to feed the reinforcement learning. The messages can contain images, a simple case can be seen in the demo with a 1d vision sensor.

viggity · on Feb 25, 2015

Can someone convert "academia nerd language" down one notch into "regular nerd language". On the surface, this sounds interesting but despite being a huge nerd I'm not really sure what the hell they're talking about.

discardorama · on Feb 25, 2015

They built a system that automatically learned to play 49 different games, without any human tweaking (aside from the original setup).

brentjanderson · on Feb 25, 2015

Here is a very rough translation from my POV:

> The theory of reinforcement learning provides a normative account, deeply rooted in psychological and neuroscientific perspectives on animal behaviour, of how agents may optimize their control of an environment.

Reinforcement (rewards/punishments) is a highly effective way to train autonomous individuals to succeed in arbitrary environments.

> To use reinforcement learning successfully in situations approaching real-world complexity, however, agents are confronted with a difficult task: they must derive efficient representations of the environment from high-dimensional sensory inputs, and use these to generalize past experience to new situations.

This kind of open-ended learning in a simulation is hard: Think of the number of inputs from all your senses being continually processed by the nervous system at a given time. Being able to take all those inputs, each of which changes meaning depending on context, to figure out what to do right now (while learning from the past in the process) is a hard problem, especially for a computer to solve.

> Remarkably, humans and other animals seem to solve this problem through a harmonious combination of reinforcement learning and hierarchical sensory processing systems, the former evidenced by a wealth of neural data revealing notable parallels between the phasic signals emitted by dopaminergic neurons and temporal difference reinforcement learning algorithms.

Humans and other animals have this figured out through our own brains ("dopaminergic neurons") and the combination of our senses and various parts of our nervous systems/biology (e.g. our reflexes respond faster than our cognition due to the hierarchy of autonomic nervous system response).

> While reinforcement learning agents have achieved some successes in a variety of domains, their applicability has previously been limited to domains in which useful features can be handcrafted, or to domains with fully observed, low-dimensional state spaces.

Past reinforcement-based algorithms have worked well, but require thorough understanding of the problem being solved, or for the problem to be relatively simple and predictable.

> Here we use recent advances in training deep neural networks to develop a novel artificial agent, termed a deep Q-network, that can learn successful policies directly from high-dimensional sensory inputs using end-to-end reinforcement learning.

By combining advances in deep neural network training with reinforcement learning in a "novel artificial agent" ("a deep Q-network"), our agent can learn sophisticated problems through only reinforcement learning.

> We tested this agent on the challenging ___domain of classic Atari 2600 games. We demonstrate that the deep Q-network agent, receiving only the pixels and the game score as inputs, was able to surpass the performance of all previous algorithms and achieve a level comparable to that of a professional human games tester across a set of 49 games, using the same algorithm, network architecture and hyperparameters.

Our new approach works across 49 games using the same approach for each game (where each game presumably has different rules and dynamics), and is able to perform at the same level as a professional human being.

> This work bridges the divide between high-dimensional sensory inputs and actions, resulting in the first artificial agent that is capable of learning to excel at a diverse array of challenging tasks.

We've created a universal learning algorithm that can take a multitude of inputs and consistently respond correctly without having to re-define the model for each game or problem.

j_m_b · on Feb 25, 2015

|> This work bridges the divide between high-dimensional sensory inputs and actions, resulting in the first artificial agent that is capable of learning to excel at a diverse array of challenging tasks.

|We've created a universal learning algorithm that can take a multitude of inputs and consistently respond correctly without having to re-define the model for each game or problem.

Further along in the article, it turns out it is actually a subset of games which don't involve long term planning strategies.

chriswarbo · on Feb 26, 2015

> Further along in the article, it turns out it is actually a subset of games which don't involve long term planning strategies.

Unfortunately I can't access the paper, and the ReadCube link seems to require Flash, but I'm assuming a "Q-network" is based on Q-learning. Q-learning is essentially a lookup table of states to actions, where we guess what state we're in from our observations and look up which action to perform that will get us the most reward.

In that sense, it's clear that a Q-learning approach would struggle with long-term planning, since its memory only goes as far as 1 action. Of course there are ways to extend Q-learning, but these tend to destroy it's best feature: implementation efficiency.

One nice alternative I've seen in recent years is Gradient Temporal Difference, which allows linear functions to be learned rather than just single actions, and retains lots of the performance properties of Q-learning (O(n) in the number of functions learned, off-policy learning, etc.).

DavidSJ · on March 2, 2015

Q-learning is not a lookup table from states to actions. It's a learned mapping (using any method you want, such as neural networks) from agent history (i.e. the agent is allowed to consider the past) & action pairs, to estimates of future reward, learned through temporal differences between the prediction at time step t, and the sum of new prediction plus immediate reward at time step t+1.

In this way, information can flow back in time as the agent learns that some observation is predictive of reward, and then learns some observation is predictive of that observation, and so on until it connects the reward with some action K time steps ago. So long-term planning is definitely possible.

This does require that some relevant information be available at each intermediate time step to connect the actions with the ultimate reward. The nice thing about these Atari games is you can usually judge value just by what's immediately on the screen, and in this paper they only use the last three frames' worth of state. In a game requiring a memory for past sensory information (e.g. where information appears on the screen then disappears), this might not do so well, but that's more a matter of working memory than long-term planning, and a different Q-learning system could contain a working memory (e.g. if it was RNN-based).

sharemywin · on Feb 26, 2015

PDF:

http://arxiv.org/pdf/1312.5602v1.pdf

p1esk · on Feb 26, 2015

No, this is a new paper.

eveningcoffee · on Feb 25, 2015

I am wondering that what kind of real life problems could be modelled this way.

craftit · on Feb 26, 2015

All sorts, we have a machine learning service which lets you do something similar with JSON messages.

https://aiseedo.com/demos/cookiemonster/

phkahler · on Feb 26, 2015

>> I am wondering that what kind of real life problems could be modelled this way.

Driving a car? Making a taco? Working the checkout counter?

craftit · on Feb 26, 2015

Keeping someone entertained in a game, the longer they play the more reward you give it. I was also thinking of hooking up a face recognition system, so it can be rewarded when the user smiles.

eveningcoffee · on Feb 26, 2015

So could you define an cost function for these activities?

craftit · on Feb 26, 2015

I suspect setting up motivations for the AI is going to be a big research issue before too long. If you can write a simulator for the task you want to it solve, you should be able to train it. Often writing the simulator is much easier than solving a task itself. For example, the atlas robot, then can simulate it but struggle to control it.

Someone · on Feb 26, 2015

For comparison: http://www.cs.cmu.edu/~tom7/mario/. That is way more of a hack, but I am not sure this is that big a step forward. Space invaders and breakout aren't the hardest games and I haven't heard a hard argument that it is just a matter of scale to create a machine that, say, plays chess.

TheEzEzz · on Feb 26, 2015

The biggest differences are this:

1. The Mario algo has direct access to the game state, and will only work for games where it has that game state access. The DeepMind algo plays directly from the screen pixels. That means DeepMind has to first learn to interpret time varying (!) visual information correctly, then deduce rules and good play strategy on top of that leaky abstraction. That's hard. It also means the algorithm can be applied to any game with a screen output, not just to an Atari.

2. The Mario algo is doing a direct search through move space. It can back up and explore a different branch of the tree and play differently to see a different outcome. When the DeepMind algo plays Atari, it can't undo a move that it just did. It has to make good choices, using intelligence, just like a human player would.

The impressive thing here is not that it plays Atari games. You're right, we have had AIs that can do this for a long time, even better than this. The impressive thing is that it's a single AI algorithm that works for many games, and that is learning directly from the screen. We have not had anything like this before.

chriswarbo · on Feb 26, 2015

That's pretty cool (especially the ending of Tetris!), but the crucial difference is that it's trying to learn by observing what happens during a known-good run, then it sounds like some game-specific learning is performed. In contrast, the Arcade Learning Environment (the benchmark used in TFA) is completely autonomous; there is no human guidance or game-specific learning: anything it learns must work well across all of the games (ie. it's either generally useful, or it gets surrounded by a bunch of conditionals).

In particular, the author of your link says he's surprised that the algorithm can sometimes get further than twice the length of his pre-recorded training sequence.