Deep Reinforcement Learning

awwaiid · on June 18, 2016

"Previous attempts to combine RL with neural networks had largely failed due to unstable learning. To address these instabilities, our Deep Q-Networks (DQN) algorithm stores all of the agent's experiences and then randomly samples and replays these experiences to provide diverse and decorrelated training data."

... so, made the machines dream. Fancy!

cosmoharrigan · on June 18, 2016

For historical background on this part of the algorithm, called "experience replay", see this paper from Long-Ji Lin in 1992:

Self-Improving Reactive Agents Based on Reinforcement Learning, Planning and Teaching

http://link.springer.com/article/10.1023/A:1022628806385

as well as his excellent 1993 PhD thesis:

Reinforcement Learning for Robots Using Neural Networks

http://www.dtic.mil/dtic/tr/fulltext/u2/a261434.pdf

ehsanu1 · on June 18, 2016

Aha, so that's what (human) dreaming is for.

Houshalter · on June 18, 2016

IIRC there is evidence rats replay their experiences, sped up, while they are asleep. Dreaming may be something else entirely though, because my dreams aren't anything like my memories of the day before.

Artificial neural networks can "dream" by predicting what frame it will see next. This is a really cool technique. They've shown slightly blurry videos of atari games being played that comes entirely from the network's dream. With no interaction with the game at all. You can even train the reinforcement learning on the dream sequences and improve it's performance.

But this also doesn't seem quite like what human dreams are. Human dreams are wild and unrealistic, while the NN dreams try to match the training data as closely as possible.

glial · on June 18, 2016

> IIRC there is evidence rats replay their experiences...

https://en.wikipedia.org/wiki/Hippocampal_replay

for anyone that's interested.

chris_st · on June 18, 2016

Quite a few of my dreams seem to be slightly scary situations (dangerous behavior of other drivers while I'm driving, for example).

This makes me think that we have a "scenario generation" engine as part of our brains, and so we're being presented with novel experiences (often "wild and unrealistic") to react to.

So this seems to be a pretty big step beyond the idea of presenting previously seen situations (which sounds like a really good first step, of course).

Jabbles · on June 18, 2016

You can even train the reinforcement learning on the dream sequences and improve it's performance.

I'm not sure how that would work. Surely you'd be overfitting on your training set by definition?

Houshalter · on June 18, 2016

Reinforcement learning has a problem in that it gets very little labelled data. You may have a million frames, but the only label is the score. Which may only change a few times per game.

Training the net to predict the next frame is sort of unsupervised learning. It can learn the rules of the game without score information at all.

The second thing is that RL is different than prediction. Even if you can predict the next frame exactly, finding the optimal set of moves is still a hard problem. The algorithm needs to learn more than just predicting what will happen, but also what the optimal action is in every situation. That is something that can be practiced in simulations, or "dreams".

aivosha · on June 18, 2016

i have read somewhere the dreams are for training purposes and present the dreamer situations either from the past or present or combination. that would explain why dreams are mostly dangerous or "wild". this is how our mind tries to figure out how to deal with those situations in the future. good examples of this are same recurring dreams over and over. and also i think there is no sense of time in the brain while dreaming so it does not differentiate childhood "situations" from any current ones. It just keeps fucking trying to solve that damn math exam problem....

justifier · on June 18, 2016

i've dreamt up a few possible explanations for dreams

currently my favourite is for converting short term to long term memory

my most vivid dreams are when i experience something new, or have an unexpected reminder about something i last thought about years ago

i imagine the surreal nature of the dreamscape is some consequence of developing neuronal paths

storing a new idea about a boat forms a path to a neuronal pathway referencing fire and your dream has you sailing across a sea of fire

visarga · on June 18, 2016

I thought it was for regularization.

jamessb · on June 18, 2016

The comparison to dreaming reminds me of a comment in Information Theory, Inference, and Learning Algorithms:

"One way of viewing the two terms in the gradient (43.9) is as 'waking' and 'sleeping' rules. While the network is 'awake', it measures the correlation between x_i and x_j in the real world, and weights are increased in proportion. While the network is 'asleep', it 'dreams' about the world using the generative model (43.4), and measures the correlations between x_i and x_j in the model world; these correlations determine a proportional decrease in the weights. If the second-order correlations in the dream world match the correlations in the real world, then the two terms balance and the weights do not change."

syngrog66 · on June 18, 2016

next up: make them dream of electric sheep

mtgx · on June 18, 2016

Let's hope Google doesn't feed it video games like Battlefield, where it learns how to most effectively kill humans.

seanwilson · on June 18, 2016

When it's playing a game (e.g. breakout) and it's being fed the pixels on the screen, how is the AI being told what the score/progress is? Does it have access to some numeric metric that is chosen by the researchers for each game?

sanxiyn · on June 18, 2016

Yes.

For example, Breakout saves score in address 76 and 77. Arcade Learning Environment has code to read the score, one per game. Code for Breakout is here: https://github.com/mgbellemare/Arcade-Learning-Environment/b...

seanwilson · on June 18, 2016

Thanks for that! So I was looking at the one for Montezuma's Revenge: https://github.com/mgbellemare/Arcade-Learning-Environment/b...

The reward seems to be only the game score which I believe is meant to be problematic for this game because your score doesn't go up very often (so you have to perform a lot of actions to get any feedback)? The lives are recorded but aren't part of the "get reward" method...are the lives factored into decision making somewhere else? Seems like knowing you just lost a life would really help decision making for such a game.

Phemist · on June 18, 2016

I think novel game situations were counted into the reward function as well

gwern · on June 18, 2016

Only in the specialty novelty-oriented DQN agents; the Montezuma's Revenge reward itself remains the same. The problem is defining 'novel' when every screen's pixels may be different (for example, imagine any game which has a timer ticking up).

Phemist · on June 20, 2016

Not so much, a timer ticking up is only novel the first time round, and is unrelated to actions taken by the agent. Over multiple plays the agent will learn to ignore it.

EDIT: It could be that the agent will just stand there the first few plays around, enjoying the novelty reward gained from simply watching the timer tick up. Haha

gwern · on June 22, 2016

The point is that every time the timer ticks, if you had defined 'novelty' as the bitstring representing the screen, you get a 'new' state. This multiplies against any blinking animations, any moving enemies, any of the agent's moves, any visible scores, etc. You get thousands or millions of unique framebuffer states before the agent has so much as left the first room in _Montezuma's Revenge_. And DQN already is RAM intensive for the experience replay buffer.

Phemist · on June 22, 2016

Thanks for the extra explanation. It seems I assumed too much about these Deep Q networks, due to some prior knowledge of the neuroscience related to RL. Although I do remember having seen a video about Montezuma's Revenge a week ago or so, where they talked about this exact problem.

Anyway, it would seem to me that novelty functions that would allow the agent to ignore periodic changes in state such as timers going up, can be quite simple. A function that estimates novelty of individual bit values in the bitstring of the gamestate and then aggregates it, could quite easily account for timers, or generally elements changing periodically regardless of agent actions. A baseline novelty reward would seem relatively easy to predict by the agent and thus result in low prediction errors and low reinforcement of actions by agents. This function would have a linear space and time complexity to the length of the gamestate, and fairly naive & simple to use, but would get the job done I think?

P.S Just wanted to thank you for the work you've put into your website, it's very informative and always a great starting point to dive deeper into topics you cover!

gwern · on June 23, 2016

You have to come up with something or else the agent will never be able to explore worth a damn in complex domains. Imagine trying to learn to write Haskell programs by typing random gibberish...

'gamestate' is illegal. It's pointless to suppose an agent which has access to the true groundtruth RAM of the Atari games, because that generalizes to vanishingly few other domains. The goal is to create a general agent which can be used elsewhere, such as in recommender systems. (And if you did have access to the raw RAM, that would reduce the problem from an extremely challenging POMDP or harder, to a fully-observed deterministic MDP, because you could then construct a game-tree of each individual RAM state and the possible actions taken in it; in which case, you would use a much faster and more powerful MDP solver like MCTS rather than DQN.)

One can come up with hand-crafted heuristics which might improve over the naive bitstring equality approach, but your suggestion still doesn't do the trick, assuming you could figure out how to meaningfully define 'periodic changes' and teach the NN to ignore them. Imagine a game in which the overall screen lighting varies (perhaps it's set at night or during rain, or perhaps each level has different color themes). As all the bits keep flipping with changes in lighting/intensity, you'd be in about the same place.

msohcw · on June 18, 2016

Yes, it's fed the score as the reward value used. If I'm not wrong, they didn't normalise it across games for the initial paper but normalised it to some range for some of the following research experiments.

cosmoharrigan · on June 18, 2016

The original papers [1][2] for the deep Q-network used reward clipping: "As the scale of scores varies greatly from game to game, we clipped all positive rewards at 1 and all negative rewards at -1, leaving 0 rewards unchanged".

[1] https://www.cs.toronto.edu/~vmnih/docs/dqn.pdf

[2] http://home.uchicago.edu/~arij/journalclub/papers/2015_Mnih_...

Dolores12 · on June 18, 2016

They can measure time you played. More time in game mean higher score.

tintor · on June 18, 2016

Labyrinth? I have a feeling that Doom is next.

shogunmike · on June 18, 2016

You might find the VizDoom project interesting: http://vizdoom.cs.put.edu.pl/

seanwilson · on June 18, 2016

> Labyrinth? I have a feeling that Doom is next.

This would be really interesting to see. I'm curious if Google would avoid this game though since that the media would likely have a field day reporting about a murderous Google developed AI if they did this. Technically we already have this in games anyway though (e.g. bots in FPS games).