This is a phenomenal example of how to teach math. You can go through theory, formulas and proofs all day long (yes, sometimes rigor is needed), but this type of teaching is sticky. It takes a lot more time to display information in the way the author has done, but it reaps massive benefits for those trying to learn. In my mind, math is about concepts and there is no reason why we can't start teaching math like this.
Amazing work. The collective intelligence of the world has just gone up.
I don't know. Is there really a massive benefit? The visual explanation of the Markov chains on that web site looks elegant and simple because... it is, i.e. it's not really a contribution of the visualization, it's because DTMCs are actually easy to understand. In a lecture, the teacher can easily simulate a Markov chain on a black board ("Now we are in this state, now we have to roll a die to decide whether we will go to state X or state Y"). Same for sealf-learners: In a book, you can tell the reader to start at some state X, roll a die, etc.
Honestly, I find that the Markov chain website is an example how you should NOT do it. You are replacing the active intellectual process in the head of the student by some trivial animations with scrolling texts and jumping tokens which look nice and dynamic but...what are you learning?
Of course, this doesn't mean that visualizations are useless, but they should be carefully designed with some learning goal in mind. Not easy! For example, the visualization could slowly gray out states that are not visited. In that way, the student could get a feeling for concepts like recurrence etc.
Totally disagree. ‘Same for self-learners: In a book, you can tell the reader to start at some state X, roll a die, etc.’ ->
This requires a learner to parse a procedure correctly AND understand how it relates to the topic. This is both doubly harder on the student and doubly likely to force a mistake (either in the procedure, interpretation, or both). Learning with an approach as given in the article approximately halves the work a learner must do, and increases the likelihood that the topic is correctly interpreted.
Retention with this method, however, is a whole other matter- one that does correlate with the amount of mental effort required, but how much so is a matter of debate.
I probably understand where you are coming from. But, you are may be missing the point?
> The visual explanation of the Markov chains on that web site looks elegant and simple because... it is
I strongly resonate with this statement. However, I disagree with the claim the "You are replacing the active intellectual process". In fact, it is the opposite - it is stimulating.
Let me present an anecdote which will perhaps resonate with many people. Often times, when venturing into a new topic, one tends to read the reference texts. Let us be frank - reference texts are not beginner friendly no matter how much they claim (they are reference for a reason). To get to the meat of the story, a reader has to hold lot of context from past knowledge. This is hard and often towards reaching the actual results, the reader is already exhausted. Instead, such intuitive visualizations provide context to a new reader. The next time the reader goes back to the reference text, this "lookahead" (from the visualization) is valuable. It is easier to internalize that where in the grand scheme of things a particular topic lies.
> Of course, this doesn't mean that visualizations are useless, but they should be carefully designed with some learning goal in mind. Not easy! For example, the visualization could slowly gray out states that are not visited. In that way, the student could get a feeling for concepts like recurrence etc.
Your intellect can be "activated" in the pursuit of a superficial knowledge of Markov Chains to tell your boss or remember in the future that it might help you with problems. EZ-learning stuff will help you then.
Or your intellect can be "activated" in the sense that further engaging it leads you to deeper, more general knowledge. Is this website doing it? I don't believe so.
This disagreement demonstrates why professional teachers are so valuable.
Sometimes the material is important enough that you want to make it as easy to understand as possible. Sometimes the material is a chance to teach your students how to problem solve. Or how to teach themselves new material. Or to help them become mathematically literate (i.e., able to read a book or paper on their own without your help).
A good course/curriculum does both. Some definitions are carefully explained at an intuitive level so that students can struggle with things beyond the defintion. Sometimes the definition is an opportunity to teach students how to go from a formal definition of an intuitive understanding. Or come up with extensions to the definition on their own. Or explore the implications of a definition on their own.
In short, are you teaching the definition of a Markov Chain as key content knowledge, or is this being presented as an exercise in mathematical creativity and mathematical problem solving? There is no "right" answer without more context; a good course will do the former with some topics and the latter with others.
So this is not a topic-level question. It's a course-level or curriculum-level question.
Interestingly, misunderstanding the fundamental tension between "explain well" and "teach people how t learn for themselves" is the source of many public debates on education. E.g., common core Mathematics. Are we teaching students how to add numbers or how to problem solve? Well, ideally both... but the debates always become an either/or.
"For example, the visualization could slowly gray out states that are not visited. In that way, the student could get a feeling for concepts like recurrence etc."
You know, this is actually a good idea. As long as you didn't set any transition matrix elements to zero, the chain will have a stationary distribution that is given by the lead eigenvector of the transition matrix. If you gave each node a score when the chain passed through it, that eigenvector would manifest as the "score" accumulated by each node over a long period of time.
This is one of the cool, deep properties of the chain - and it's not "insight-free" in the same way you describe.
One of my favorite examples of this is the plinko game at the Museum of Math in NYC:
It's a physical manifestation of the central limit theorem, realized by a series of balls that can go either right or left in a random walk downwards. You can even adjust the bias on the left/right choice to supply drift to the random walk. The mass of balls that accumulates in the bottom is a histogram representation of the associated limiting Gaussian curve. Very cool.
I don’t think I agree, but ultimately only feel comfortable speaking for myself.
I’m fairly new to programming so would be interested in hearing if someone else does this too...
On a walk this afternoon (for some reason), I realized I had a visual shape in my head for most design patterns. Some of them look a lot like a knot-tying diagram, for example. I’m not sure I had realized that before.
I also realized the shape representations are about as close to implementation as I would prefer to store design patterns in my mind. Beyond that, they might be limiting or, at worst, strenuous to use as guidelines. They’re surely more useful (to me) than UML or even code.
But the best part is I can tell they’ve been there all along. I wasn’t trying to make them up. I was just mapping out a project while on a walk. I was pretty relaxed at the time and maybe that’s why this came up.
For what they are, I was a little disappointed how long it took to parse the Elements of Reuseable OO Software book when I was just learning to code. And then when I had used the patterns and knew them, I found it a bit sad they were communicated through such rigid forms.
Art actually contends with the same sort of conceptual divisions. There’s the medium which is rigid and cumbersome, the form which is more pure, critical in the abstract but medium-agnostic, and the application, being the interface or the thing out in the world.
So I do agree the ability to transpose these concepts between form and implementation is what matters, but it’s the back and forth that matters, not where one begins.
I both agree and disagree. I've always found an intuitive explanation like the OP to be helpful before diving into a more rigorous study of the material. It helps contextualize the things you learn subsequently.
3Blue1Brown is impressive in how it uses visualizations, cute faces, soothing voiceovers, etc. but keeps pulling you down the rabbit hole. So at the end of the video you know that you don't know stuff you either think you knew or didn't know existed. And then you start looking up Wikipedia and just getting lost in the music.
This is really really rare. Education as entertainment means that the goal is to get the consumer to feel good about himself, to feel like he learned something. You're getting paid (in money or kudos or whatever) to engage the consumer's narcissism, not his curiosity.
> ...keeps pulling you down the rabbit hole. So at the end of the video you know that you don't know stuff you either think you knew or didn't know existed. And then you start looking up Wikipedia and just getting lost in the music.
Education as entertainment means that the goal is to get the consumer to feel good about himself, to feel like he learned something. You're getting paid (in money or kudos or whatever) to engage the consumer's narcissism, not his curiosity.
This is a profound observation. I’m not sure it only applies in entertainment.
I’ve heard it put other ways, but your version, if applied to education as a whole, might be a pretty big deal. There’s no shortage of evidence, either; from Ivy League campuses to niche interest subcultures.
Maybe “narcissism” isn’t the best word, but some sort of similar tension related to confidence seems to be associated with successful educational environments. We commonly attribute it to a one-way causal relationship, but you’re articulating the opposite. Likewise, we are used to acknowledging that ‘not giving up’, or ‘belief in one’s self’ is a necessity, but again you’re making a separate claim.
You’re saying, in so many words, the confidence in and of itself defines the will as well as the curiosity. And I can’t argue with it. I want to, but I’m not sure I can.
I feel like the key thing about a "growth oriented mindset" or some near-equivalent buzzword is that the normal state of affairs to not know things. This is the foundation (in Socrates) of our civilization.
What I've been trying to articulate here and there in the past few weeks that the object of intellectual development should be acquiring structured ways of not knowing. Rigorous mathematics is such a way: in basic analysis class you get red marks for thinking you know things you can't prove. The buddhism of the Heart Sutra is another: the Heart Sutra boils down to "we want to know big truths about emtpiness and suffering to transcend it and achieve Nirvana, but because of emptiness there's no way to know and nothing to know; therefore, recite these magic words because that's what wise men do". (It sounds like a scam, but it's really deep in how the argument is actually made in length).
This of course runs counter to beliefs in "ultratechnology" such that we'll soon soon craft our own transcendence. But eh, we shall fight in the shade.
This seems like quite an overreaction. This animation could easily confuse people who don't know as much as you do. The little ball moving back and forth -- what is that? What is doing the moving? What determines the speed of the ball?
These are the questions you'll be running into when you show this to middle and high-school students who have no idea what state space is and what matrix operations are. Like most math visualizations, this one seems best for people who are already somewhat familiar with the material.
I think it's a cool animation, but I'm almost sure I'm suffering from a hindsight bias. My physics professor used to say that nothing is worse than an analogy taken too literally.
I agree. On the other hand, it is lowering mental effort required to study the topic, and that will probably not make every student more clever, maybe more lazy, just waiting for the next youtube video to be told new stuff about maths instead of going research and study on its own.
The thing is that 80% of the usefulness of learning is that it's learning practice. This is why you sit for so much crap in K-12 -- you're learning how to learn. This rings even more true for techies who are always chasing a moving target (nonono, object-oriented CORBA RPC, nono, it's REST, it's Kernel SVM, it's Convolutional Networks....). The effort is (almost) the whole points.
If you don't keep your learning organs stimulated they will literally wither and shrink.
I agree with you but I don't think we should make things unnecessarily hard to understand if they aren't so. Do you? There will always be difficult things to learn out there but I don't think we shouldn't artificially make all things difficult if we don't have to.
we should make things unnecessarily hard to understand if they aren't so
Who said otherwise ? I am just saying that if you as a teacher are concerned about making people/students more clever, you need to think both at vulgarization but also at the meta level where they need to actually make mental effort to incorporate ideas, maths, methodology, calculus, visualization capabilities and in general self-motivation
If you consider the brain like a muscle, then "leaving mental effort for elsewhere" doesn't really make sense. If anything, it's making it more resilient for future work.
This really drives home a sentiment I've acquired during the course of my college education: CS/Math is often considered "hard", but I feel that's just because we've struggled with getting good visual/verbal communicators to dedicate their lives to CS/Math education. I really feel that when explained properly (and the definition of "properly" sometimes need to be adapted from person to person), topics like Gibbs Sampling or Fourier Transforms or Backpropagation aren't topics that should take entire weeks of self-study to grasp in 2018. Yes, they require some math background, but there's some strong intuition behind them. Maybe I'm just slow or thick in the skull.
Does anyone know the probabilities of HN moving between states “animated Markov chains article at the top” and “animated Markov chains article not at the top”?
There is definitely memory, but HN also has new members.
If repeat topics move up to a top ___location, this probably reflects the number of people who hadn't yet seen the topic. "Not new, but new to you."
Even without new members, not everyone is on all the time, so some people miss stuff like this. Or, even if you've seen it again, you vote it up, because you like it and want others to see it.
And Hidden Markov Models (HMMs) are just ones where the symbols are emitted upon interaction with an edge of the graph instead of the node. The nodes are still the "states" of the system, but in general you can't know which state the model is in just by observing the symbols emitted by the edges as there may be equivalent structures in the graph associated with different states.
Really great visualization here, kudos to the author. Really makes is clear what's going on. Made me think that things like this could be useful in textbooks as well if they were more digital. Maybe in the next few years there will be a better way to integrate this kind of stuff into courses, I guess today they could always happen in lectures.
Take a look at distill.pub (which wsa also featured on the HN frontpage today). Its essentially a Machine Learning Journal with a focus on clarity and visualization, and is therefore interactive and visual. Colah is around on HN and might comment explaining it better.
Is there a standard way to build markov chains with deeper memory than just the current state? For example in the rainy sunny markov chain our rainy probability should/could decrease as the number of rainy days in a row increases. In a pure state machine this would require an infinite number of states for SS...SSS and RR...RRR.
Lots of other commentators are saying that the property of MCs is that they are memory-less, but in fact it's trivial to make a MC have deeper memory: Use tuples.
So if you have observations A, B, C, B, C, A, D, A, ... then you form pairs: (A,B), (B,C), (C,B), (B,C), (C,A), (A,D), (D,A), ...
Each node in the MC is one of these pairs, and (X,Y) -> (Y,Z) if Z is the new state that appears.
The number of states grows very quickly, but if you make the system dynamic then it works fine. This is what you use to randomly generate text using the previous 3 words as a predictor for the next. Then (w0,w1,w2) predicts w3, and your next "state" is (w1,w2,w3).
As a confused undergrad first learning about random dynamical systems on my own (outside of any classes), this really bothered me. How could the Markov assumption be so powerful when it's clearly trivial to add memory and retain the Markov property?!
After reading through theorems and proofs for dynamical systems theory, and especially considering applications of that theory, I came up with basically two practical take-aways:
1. Complexity matters and the curse of dimensionality is real.
2. All those nice theorems (stability, convergence, splitting, invariance, bifurcations, ...) are stated with respect to a state space and make assumptions about the dynamics on that state space. So if you code up a state space representation that is history-aware with respect to the dynamics you really care about, then:
a) you might lose some theorems that hold for the original process, and
b) you sometimes realize the theorems lose a lot of power when they're telling you something about a tuple of things you care about instead of the actual thing you care about.
If I ever teach people about random dynamical systems, this explanation will come immediately after the definition of a Markov process and I'll continually remind my audience throughout the course. Understanding what goes wrong when a system is "technically" Markovian is probably the best way to obtain a deep intuition. Or at least, it was very helpful to me.
Isn't this like an "auto-regressive feed forward neural net"[0]?
> Instead of making predictions from a state that depends on
the entire history, an autoregressive model directly predicts yt using only the k most recent inputs, xt−k+1,…,xt. This corresponds to a strong conditional independence assumption. In particular, a feed-forward model assumes the target only depends on the k most recent inputs. Google’s WaveNet[1] nicely illustrates this general principle.
Is this equivalent to each node representing a memory state?
Seems like it would not solve rtkwe's concern. In their example using tuples make a node state equal (S, COUNT_OF_DAYS), (R, COUNT_OF_DAYS)? You would still need an infinite amount of nodes to represent an infinite precision integer state.
You don't store count count of days, each node is the sequence of the previous k states. Usually k=1, but set k=2 and you get something where each state is predicted based on the previous 2 states.
So no, you don't need an infinite amount of memory, it's all still finite (although rapidly gets large).
The definition of the Markov property is that the process is memoryless - the probability distribution of the next state only depends on the previous state. This makes them an easy tool to use for mathematical analysis, but limits the amount of things they can model completely.
One simple way to add memory is to make states more complex. For example, instead of recording the weather each day as a state, you could record the weather for two day tuples. This allows more complex dynamics, but the number of states quickly increases.
Hierarchical Markov Models are another way to achieve this. In a HMM, there is a probability matrix for different types of weather (winter vs summer) and a probability matrix that governs transitions between each of these types.
The Markov chains we normally we hear of are first order Markov chains. The general notion being d-order chains. These are hard to compute (too many model parameters) but there has been quite some research on an interesting middle ground: variable length Markov chains.
By definition, a Markov chain only depends of the current state. That said, I can think of a way to make something depending on the past as well: if you have two states (S for sunny, R for Rainy), with two days of state you now have four states: SS, SR, RS, RR, which gives you a transition probabilities matrix - of course you will have some restriction, for example from SS you can only go to SS and SR.
I think one of the features/requirements of Markov chains is that they be stateless. Previous states cannot influence the current probability.
It really depends on how granular you want to cut it.
You could make 10 rainy states, each with an increased probability of sunny, equal chance to stay or go to a lower probability, and no chance of going to a state with a higher probability.
So R90 could be S = 10%, R90 = 45%, and R80 = 45%.
R80 could be S = 20%, R80 = 40%, and R70% = 40%.
And so on.
That way you don't need to track how many days it's been raining, it will naturally filter down to lower rainy percentages until it inevitably goes back to sunny.
You’d need some kind of higher structure of computing.
One method for longer-term dependence is simply expanding the transition matrix and state space, but that’s not unbounded in any meaningful sense.
A recurrent neural network approximates unbounded long-term information for general tasks, but I think what you’re looking for in particular is a push-down automaton, while a Markov chain is a finite-state automaton.
I'm not sure how Dyer's thing works, but I can describe the paper I linked to:
A "stack" is a stack of (vector, scalar) pairs (a vector and its "weight"). At each time step, the stack has three inputs: the vector to push, the weight to push it with, and the weight to pop off the stack. Its output is the top 1.0 weight of the stack. The stack's behavior on a time step is divided in to three parts:
1. First, that pop weight is removed from the stack. For example, if you want to remove 0.6 from a stack like [(a, 0.4), (b, 0.4)], you'd remove all of a leaving you with 0.2 to remove from b, so the final stack would be [(b, 0.2)].
2. The pushed vector is placed on the stack with the given push weight.
3. The top 1.0 of the stack is blended together and returned. For example, a stack like [(a, 0.4), (b, 0.4), (c, 0.8)] would take 0.4 of a, 0.4 of b, and 0.2 of c.
This process is differentiable, and you can combine this with the usual recurrent loops to make a recurrent stack machine.
The particular behavior you mention can be modeled by a Semi-Markov Model. In these, the time you stay at a given state can be an arbitrary fixed distribution. In a regular chain this time is geometric, so memoryless. But other distributions, with Gaussian like tails or power law tails would produce the effect you want.
This is how we were taught it in our Linear 2 class; the professor used static pictures via chalkboard, but it was still super helpful. I feel like most of the benefit is gained in drawing the graph and talking about "hopping between states". Animating it might help some people who think more mechanically.
Isn't this affected that its difficult to get true random by computers? Its more pseudo random and therefore the chance is not real chance but predictable.
While true, there are cryptographically secure PRNG's that are incredibly difficult (Theoretically impossible) to predict.
In any case, unless you're using Markov chains to generate data that needs to be kept secure (Like generating a password that could be pronounced), the predictability of random numbers isn't really relevant.
Amazing work. The collective intelligence of the world has just gone up.
Another amazing example of visual math lectures, https://www.youtube.com/channel/UCYO_jab_esuFRV4b17AJtAw.