Is there a standard way to build markov chains with deeper memory than just the ...

ColinWright · on Aug 15, 2018

Lots of other commentators are saying that the property of MCs is that they are memory-less, but in fact it's trivial to make a MC have deeper memory: Use tuples.

So if you have observations A, B, C, B, C, A, D, A, ... then you form pairs: (A,B), (B,C), (C,B), (B,C), (C,A), (A,D), (D,A), ...

Each node in the MC is one of these pairs, and (X,Y) -> (Y,Z) if Z is the new state that appears.

The number of states grows very quickly, but if you make the system dynamic then it works fine. This is what you use to randomly generate text using the previous 3 words as a predictor for the next. Then (w0,w1,w2) predicts w3, and your next "state" is (w1,w2,w3).

And so on.

throwawayjava · on Aug 15, 2018

As a confused undergrad first learning about random dynamical systems on my own (outside of any classes), this really bothered me. How could the Markov assumption be so powerful when it's clearly trivial to add memory and retain the Markov property?!

After reading through theorems and proofs for dynamical systems theory, and especially considering applications of that theory, I came up with basically two practical take-aways:

1. Complexity matters and the curse of dimensionality is real.

2. All those nice theorems (stability, convergence, splitting, invariance, bifurcations, ...) are stated with respect to a state space and make assumptions about the dynamics on that state space. So if you code up a state space representation that is history-aware with respect to the dynamics you really care about, then:

a) you might lose some theorems that hold for the original process, and

b) you sometimes realize the theorems lose a lot of power when they're telling you something about a tuple of things you care about instead of the actual thing you care about.

If I ever teach people about random dynamical systems, this explanation will come immediately after the definition of a Markov process and I'll continually remind my audience throughout the course. Understanding what goes wrong when a system is "technically" Markovian is probably the best way to obtain a deep intuition. Or at least, it was very helpful to me.

dminor · on Aug 15, 2018

Exactly - even if your process is conceptually a Markov chain, your code doesn't have to explicitly enumerate all of the states.

I used a similar scheme to your example, but with letters for generating a word for http://password.supply

tree_of_item · on Aug 15, 2018

Isn't this like an "auto-regressive feed forward neural net"[0]?

> Instead of making predictions from a state that depends on the entire history, an autoregressive model directly predicts yt using only the k most recent inputs, xt−k+1,…,xt. This corresponds to a strong conditional independence assumption. In particular, a feed-forward model assumes the target only depends on the k most recent inputs. Google’s WaveNet[1] nicely illustrates this general principle.

[0] http://bair.berkeley.edu/blog/2018/08/06/recurrent/

[1] https://arxiv.org/abs/1609.03499

ubershmekel · on Aug 15, 2018

Is this equivalent to each node representing a memory state?

Seems like it would not solve rtkwe's concern. In their example using tuples make a node state equal (S, COUNT_OF_DAYS), (R, COUNT_OF_DAYS)? You would still need an infinite amount of nodes to represent an infinite precision integer state.

ColinWright · on Aug 15, 2018

You don't store count count of days, each node is the sequence of the previous k states. Usually k=1, but set k=2 and you get something where each state is predicted based on the previous 2 states.

So no, you don't need an infinite amount of memory, it's all still finite (although rapidly gets large).

gimagon · on Aug 15, 2018

The definition of the Markov property is that the process is memoryless - the probability distribution of the next state only depends on the previous state. This makes them an easy tool to use for mathematical analysis, but limits the amount of things they can model completely.

One simple way to add memory is to make states more complex. For example, instead of recording the weather each day as a state, you could record the weather for two day tuples. This allows more complex dynamics, but the number of states quickly increases.

Hierarchical Markov Models are another way to achieve this. In a HMM, there is a probability matrix for different types of weather (winter vs summer) and a probability matrix that governs transitions between each of these types.

abhgh · on Aug 15, 2018

The Markov chains we normally we hear of are first order Markov chains. The general notion being d-order chains. These are hard to compute (too many model parameters) but there has been quite some research on an interesting middle ground: variable length Markov chains.

S4M · on Aug 15, 2018

By definition, a Markov chain only depends of the current state. That said, I can think of a way to make something depending on the past as well: if you have two states (S for sunny, R for Rainy), with two days of state you now have four states: SS, SR, RS, RR, which gives you a transition probabilities matrix - of course you will have some restriction, for example from SS you can only go to SS and SR.

bena · on Aug 15, 2018

I think one of the features/requirements of Markov chains is that they be stateless. Previous states cannot influence the current probability.

It really depends on how granular you want to cut it.

You could make 10 rainy states, each with an increased probability of sunny, equal chance to stay or go to a lower probability, and no chance of going to a state with a higher probability.

So R90 could be S = 10%, R90 = 45%, and R80 = 45%. R80 could be S = 20%, R80 = 40%, and R70% = 40%.

And so on.

That way you don't need to track how many days it's been raining, it will naturally filter down to lower rainy percentages until it inevitably goes back to sunny.

stochastic_monk · on Aug 15, 2018

You’d need some kind of higher structure of computing.

One method for longer-term dependence is simply expanding the transition matrix and state space, but that’s not unbounded in any meaningful sense.

A recurrent neural network approximates unbounded long-term information for general tasks, but I think what you’re looking for in particular is a push-down automaton, while a Markov chain is a finite-state automaton.

tree_of_item · on Aug 15, 2018

See [0] for a way to make recurrent neural nets in to pushdown nets.

[0]: https://arxiv.org/abs/1506.02516

stochastic_monk · on Aug 15, 2018

I’m not an expert in nlp or these augmented models. Could you perhaps compare this to Dyer’s Stack LSTM?

tree_of_item · on Aug 15, 2018

I'm not sure how Dyer's thing works, but I can describe the paper I linked to:

A "stack" is a stack of (vector, scalar) pairs (a vector and its "weight"). At each time step, the stack has three inputs: the vector to push, the weight to push it with, and the weight to pop off the stack. Its output is the top 1.0 weight of the stack. The stack's behavior on a time step is divided in to three parts:

1. First, that pop weight is removed from the stack. For example, if you want to remove 0.6 from a stack like [(a, 0.4), (b, 0.4)], you'd remove all of a leaving you with 0.2 to remove from b, so the final stack would be [(b, 0.2)].

2. The pushed vector is placed on the stack with the given push weight.

3. The top 1.0 of the stack is blended together and returned. For example, a stack like [(a, 0.4), (b, 0.4), (c, 0.8)] would take 0.4 of a, 0.4 of b, and 0.2 of c.

This process is differentiable, and you can combine this with the usual recurrent loops to make a recurrent stack machine.

stochastic_monk · on Aug 16, 2018

Thank you so much! That makes the paper a lot easier to understand.

cosmic_ape · on Aug 16, 2018

The particular behavior you mention can be modeled by a Semi-Markov Model. In these, the time you stay at a given state can be an arbitrary fixed distribution. In a regular chain this time is geometric, so memoryless. But other distributions, with Gaussian like tails or power law tails would produce the effect you want.