The problem with many explanations is that people who write them have forgotten how it is to not know Bayes Theorem. Here is my humble attempt as to where people who want to grok Bayes theorem can start:
Imagine you have been locked inside a room. You are asked what the outside weather is. Assume you have to pick one of sunny, cloudy and rainy. You know it is July so you predict it is mostly sunny. (Because intuitively you know that the odds of "sunny" is more than the other two). [1]
Now you see that someone has entered the room carrying an umbrella. You are asked that question again. Of course you are going to update your answer based on the new information. Your previous odds calculation, while good, has to accommodate the umbrella factor -- a very solid piece of additional information. [2]
Before the umbrella happened all you knew was [1] and it is called Prior.
After you saw the umbrella the odds are conditioned on the umbrella and it is called Posterior [2]. which is what you want. And you should definitely use your intuition about the weather too [1] which now goes to the right hand side of the equation.
So
Posterior = ( ) x Prior
This blank is called as Likelihood... it is the probability of umbrella given the weather.
Imagine you live in a place where the umbrella is used only during rain. So the likelihood of an umbrella given it is sunny is very very less and is going to make the product of Prior x Likelihood very less.
Whereas the likelihood of umbrella given it is rainy is almost certain. Even if your initial guess (prior) gives less odds to rain.. the product of likelihood x prior for rain is more than that of the previous paragraph product. You answer "rainy"
Of course there is more to it and I suggest reading other great explanations in this thread.
Still a pretty long winded explanation, though a good one.
I like to think of Bayes' Theorem as akin to scaling "universes". For example, consider trying to determine the probability of someone over 6 feet tall being a woman. First you take the probability of a woman being over 6 feet tall (say 1%) and then you multiply that by the probability of being a woman (say, 50%). Now you have the probability of being a woman AND being over six feet tall scaled to the overall size of the total "universe" (being a human being). Then, if you divide this by the probability of being over 6 feet tall (say, 4%) you'll get the probability of being a woman AND being over six feet tall scaled to the universe where everyone is over six feet tall (12.5%).
Agreed. I didn't really get Bayes' Theorem intuitively until I saw the pictures in this short blog post which emphasizes the idea you pointed out of 'scaling "universes"':
I actually got a really good question when I was using set diagrams to explain ideas from logic.
So, I was explaining some logic ideas to a guy on IRC, and he was struggling with the contrapositive rule, which says "if A implies B, then not-B implies not-A." He asked me what this looked like as a set diagram.
I began to teach him how "A implies B" means that whenever you have A, you know you have B. The picture is that B can be bigger, but the situations with A are contained within it, so the picture looks like this: "A is a subset of B." (In symbols, A → B becomes A ⊆ B.)
Now you need to understand that "not A" is the entire region outside the circle of A, and "not B" is the entire region outside the circle of B. And then you have to understand a crazy perspective: that the entire space outside B is now a subset of the entire space outside A.
He was very confused about this, so I explained it this way: There is a story about a physicist, an engineer, a mathematician, and a farmer. The farmer asks all three for a fence containing the largest space for his sheep.
The engineer is first up, builds a square fence using one of the walls of the barn to get a little extra space fenced in. The farmer seems pretty satisfied with this, so they both go to meet the physicist, who has tethered a cord to a peg and is drawing a large circle. "Circles," he points out, "minimize their surface-to-area amount. Actually, I could probably do the same with your barn there, get you a little more space by chopping a chord out of the circle." The farmer says "no, this looks like it will take too long."
They both come over to see the mathematician, who has apparently gotten tangled up in the fence! They start to work to get him out of there -- the farmer asks, "what were you thinking, why did you bend the fence this way, what is wrong with you?!" The mathematician says, "you don't understand -- this is the outside of this fence!"
Suddenly, the idea of flipping "inside" and "outside" seemed to make sense, and I was able to show him that yes, if you take this perspective, the corresponding rule is Bᶜ ⊆ Aᶜ, thus not-B → not-A.
(Another strategy which often works is to leverage moral intuitions, but 'permissions' and 'implications' are opposite arrows. So A → B means "if I know A, then I also know B." Apply this to "Santa only gives presents to good children". Your intuitions all will work much better, save one: you probably would write the above statement as "good child => get presents", following what is permissible for good children, but in logic it actually states "got present → good child", if Santa gave a present then you know that the child was good, but Santa might not give a good child a present, especially if that child is, say, Jewish.)
Simpler: probability is about finding the relative size of a shape.
When you want to know A|B, that means you want to know what fraction of B is taken up by the intersection A-AND-B.
Well, say you happen to know B|A, or how much A-AND-B takes up inside of A. And you also know the size of A relative to the entire page, and B relative to the same.
Then, you can compute the "absolute" size of A-AND-B by multiplying the fraction of A that A-AND-B takes up by the fraction of the page that A takes up.
Now that you know the "absolute" size of A-AND-B, you can freely compare it to lots of other shapes. In this case, you can find A|B as simply as computing A-AND-B/B.
This is identical to your explanation, except I find it helpful to focus less about chance and expectation, and to think more about geometry and sets of outcomes.
Note that my notion of "absolute" size is still just relative. You could measure size relative to any other shape, but "size relative to the entire page" just happens to be a measure that we often know.
That's a good intuition, but I think the author avoided that for some important foundational reasons.
There are two main ways of developing probability theory.
The first way is to use measure theory. This means that probability is literally defined as an area or volume, and leads to the picture you describe. However, it forces you to think of probability as picking from a set of possiblities, and it also means you can arbitrarily carve up a shape or jam infinitely many of them together. You also need to be quite careful, as you may be taking the area of some rather weird shapes.
The second way is to assign "plausibilities" to various propositions, and then use them following a few common-sense rules. This assumes a lot less, but Richard Cox proved it gives the same result as the previous method when they both apply. However, you can now talk about the probability of rain without needing to be able to decompose it into statements about the trajectory of clouds. This also lets you talk about the probability the sun will rise tomorrow or the probability I am typing this from a hot tub without thinking about possible worlds or alternate histories.
The author is a huge proponent of the latter approach, and thus likely tried to avoid talking about shapes or regions.
In high school, you probably learned the classical definition of probability, where, if I say that something happens with probability A/B, it means that, if I do an event infinitely often, on average it will happen A out of B times. This turns out to be too fragile. I can make a die where the probability of rolling an odd number is 1/2 and the probability of rolling a number below 3 is 1/3, but the probability of rolling a 1 is undefined!
I know that there's some subtlety to correctly getting meaning out of probability, and I vaguely remember having similar doubts about my approach because of this.
That said, this simple model didn't fail to explain any of the concepts I learned in elementary discrete and continuous probability. It was especially useful for understanding the latter.
I don't have a lot of brain-RAM, and I find that I get confused the second that I stop focusing on "sets of outcomes". Also, for me, attempts to use "common-sense rules" about expectation seem to fail quickest of all.
I like to think of Bayes' rule in terms of hypotheses and predictions. Take the hypotheses to be "is a woman" and "is not a woman"/"is a man", and the data to be "is over 6 foot tall". Our priors on the to hypotheses are roughly equal (50%). They predict the data with probability 1% (6' given a woman) and 7% (6' given a man). [Hence total probability of 6' is 1%50% + 7%50% = 4%.]
Now, our Woman hypothesis predicts the data more poorly than the Man hypothesis, so we need to update our confidence in the hypotheses appropriately. Bayes' rule tells us that this should be in proportion to how well they predicted the data. Originally we had Woman:Man odds as 1:1, so now it must be 1:7 (or 12.5% for Woman).
If you think of a column plot with hypotheses along the horizontal axis, where the width of each column is proportional to the prior, and the height is proportional to the strength of the prediction of the data. If you then "squash down" the columns, maintaining their area, but so that they all have the same height, the new widths give the posterior probability.
more like a verbose explanation of Bayes' Theorem...
I'll give my succinct spin: Throw a coin in the air 10 times. Say 6 times the coin lands heads up, and 4 times tails up.
A Bayesian would say: the most likely explanation is that the coin is biased in favor of heads with Prob(heads) = .6 and Prob(tails) = .4. Because given such a model, the chance of this outcome is 25% (10 choose 6 * .6^6 * .4^4). Less likely is the explanation that the coin is fair.
A Frequentist would begin by assuming a model, likely that the coin is fair. She would then argue that the probability of getting 6 heads in 10 tosses is about 21% (10 choose 6 * 1/2^10), and she would have tools (number of STDs from the mean, etc) to test how much of an outlier this observation was, in order to validate her model.
In summary, the Bayesian considers many models at once, and asks how likely are they given the observations. A Frequentist starts with one model, and asks how likely is the observation.
The Frequentist uses the scientific method: develop a hypothesis, then test it. The Bayesian does something else, which I think is better, but ill-defined. She somehow tests all hypotheses at once, and puts different confidences in each.
This is not exactly correct. Both methods you mention here are frequentist. The first is called maximum likelihood and the second is hypothesis testing.
What a Bayesian does is not ill-defined. On the contrary, what a Bayesian does is two steps: (1) determine a prior (2) follow the math mechanically. What frequentists do when they do inference is actually much more ill-defined, as it required hidden assumptions intermixed with a mathematical derivation instead of cleanly separating assumptions from maths. In principle the bayesian way is: given assumptions, simply follow the axioms of probability to their logical conclusions. Of course that may not always be computationally feasible, so approximations have to be made sometimes. Frequentists require divine inspiration to develop a new method for each kind of problem.
The bayesian way to solve the problem would be to take a prior probability distribution on the probability of heads, and then condition that on the data that the coin ended up on heads 6 times and on tails 4 times. For example with a uniform prior, we get this posterior: http://www.wolframalpha.com/input/?i=1*p%5E6%281-p%29%5E4%2F...
As your can see the peak is indeed at p=0.6, but there is probability mass around it as well.
Replace the factors of 1 with your favorite prior to see the effect (I couldn't convince wolfram alpha to compute a beta distribution in place of the 1, so you'll have to try that yourself).
The example is not quite right. The Bayesian starts with a prior, some weighting of likelihoods. In the case of a coin a reasonable prior would place a lot of initial probability on the coin being fair or almost fair. You'd need to flip a biased coin more times in order to overcome the prior and make 'biased' the dominant hypothesis.
That is correct. I tried to allude to that in the end when I said how a Bayesian mysteriously gives a certain confidence to the space of models. In the case of a uniform prior I believe my example is correct.
Usually when Bayesian probability is explained, it often comes across as "duh, that's what anyone would do" except not rigorously with equations. I tried to give an example contrasting a Frequentist and Bayesian approach.
The best explanation I found of the theorem was from the book by Bertsekas and Tsitsiklis:
-> You have a resulting event B in front of you. Any of {A1,...,An} causes (all of them mutually disjoint) could have led to B. Now, you are aware of how likely each of these {A1,..,An} is of producing this B (this is P(B|A)). Bayes theorem allows you to use this information to deduce which {A1,...,An} is most likely to have been the cause given that B occurred (P(A|B)).
i.e. you use the Bayes theorem to reverse the conditional probability relationships given in the problem. The exact expression is now easily derivable using this idea and the total probability theorem.
None of these explanations are intuitive. Here is a visual explanation in two short paragraphs. Imagine a Venn diagram of two partly overlapping circles. The probability space is your sheet of paper on which the diagram is drawn. The area of the yellow circle with respect to the entire sheet is the probability of event Y. Same measure of the cyan circle is the probability of event C. The overlapping area is green (yellow and cyan pigments, when mixed, result in green), and corresponds to both events occurring.
To simplify calculations, we don't scale circle areas by the area of the entire sheet (which incidentally turns our probabilities into frequencies). This way, P(C) and P(Y) are simply areas of the corresponding circles. Now, P(Y|C) is the probability of event Y occurring, conditional on event C occurring also. This is visualized as simply as taking the green area (because both events have to happen) and dividing it by the area of the entire circle C (because we are interested in conditional probability on C); the resulting ratio is P(Y|C). Now it should be clear as sky that P(Y|C) * P(C) = green area = P(C|Y) * P(Y), which is another way to write Bayes Theorem.
This explains the equation, but it doesn't explain the "philosophy." Being Bayesian is in direct conflict with the "scientific method." The scientific method says: make a theory, then test it against observation. To a Bayesian, first comes the observation, and they pick the theory that best explains it.
If T = theory, and O = observation
P(T|O) = P(O|T) * P(T) / P(O)
tells you: given my observation what is the probability of my theory, P(T|O). And the equation gives you how this quantity depends on how your theory predicts the observation, P(O|T).
Not true. Maximum likelihood picks the hypothesis/theory that best "explains" the data, hence you run into issues with overfitting, etc. A Bayesian starts with a prior over a set of hypotheses (the P(T) term), and then uses the data to update their confidence in the various hypotheses. Assuming you have a sane prior, you end up with a simple theory with a record of making decent predictions (i.e. the kind of thing you look for in science, cf. Occam's razor, etc).
I don't see any conflict with the scientific method (at least in a fundamental sense). Scientists aren't oracles - hypotheses need to come from somewhere. Some call it intuition - I would call it an implicit application of Bayesian reasoning, where the data is experience, and the prior is governed by genetic constraints of the brain. From this you obtain a set of intuitive hypotheses to be (further) tested (i.e. those with large posterior P(T|O)).
Testing a set of hypotheses then just involves collecting observations that differentiate them (i.e. where they make conflicting predictions). This can still be considered an application of Bayes' rule, but usually one tries to collect enough data that it's quite obvious which is consistently making the best predictions (in other words has posterior close to 1), in which case it becomes a theory.
I never disagreed with the equation (of course it's correct). My point is that the prior always comes first, even in UII. You're not simply picking the hypothesis that best explains the data (assuming by best explains you mean has the greatest likelihood P(O|T)), otherwise you just end up with the hypothesis containing a lookup table of all previous data. You need to take into account your confidence in the hypothesis before the data arrived (e.g. based on the complexity/size of programs expressing that hypothesis for UII).
Ah. Your use of not true made it look like an outright dismissal of his whole statement. As for the order of when to pick the prior, I think what is more important is that the data not influence your choice of prior. If you were some oracular machine you could see the data and generate hypothesis and priors for them independent of the data and still not fall for the problem you state.
And then there is the problem of how do you form sensible hypotheses without at least knowing the shape of the data first. The form of these hypotheses are themselves a restriction on the possible space. I think that is what the GGP was getting at.
Well, the OP was about a theorem, so I explained the theorem... Bayesian theorem formalizes the process through which human intuition works and with which we pick theories. It is hugely ironic that people not familiar with the Bayesian theorem often consider it unintuitive, since the theorem itself is the basic process of human intuition formalized.
Now, the whole point of the classic scientific method, developed before the scientific mainstream even knew about Bayesian theorem, is to "fight" human intuition since it (probably rightfully so) considers human intuition as something that is only useful to make a guess, but as not really useful (and one that often gets in the way) when one needs to confirm a theory. To summarize, the scientific method was developed to "fight" Bayesianism before Bayesianism existed. Hence people writing long articles and philosophizing about what should be a trivial issue.
But Baye's theorem is not obvious to humans. The intuition is P(A|B) = P(B|A). Base rate fallacy etc. etc.
That said, Bayesianism is not the same thing as Baye's Theorem. You can be a frequentist and still apply Bayes Theorems or the like. Bayes theorem is straight forward once you train your mind on it.
Bayesianism is much more than that and is not a trivial issue. But you are also correct in that human philosophical intuition on probability is Bayesian. With a very broken application. Few people that claim to be Bayesians are actually doing pure Bayesian probability.
Hand wringing over priors and model structure is what Bayesianism is all about. Philosophically, the viewpoints of bayesianism and how to best pick priors are interesting. Also interesting is how Quantum Mechanics fits neatly into the bayesian perspective.
I have a feeling that Bayes is unintuitive specifically because people pick extreme or otherwise bad examples to explain it. As you mention, the way we calculate odds is oddly (pardon the pun) similar to Bayes.
Here is an African Safari example. The probability of an individual hunter being attacked by a lion P(L|H) (assuming any lion who tracks down a hunter will also attack him) equals to the conditional probability of the hunter tracking down a lion P(H|L) (which is a measure of how good the hunter is at finding lions assuming there are lions) times the general probability of encountering a lion in the bush P(L) divided by the general probability of encountering another hunter like himself in the bush P(H). Hence why you can get away with hunting lions alone only if you're very bad at finding lions or if there are no lions.
> The scientific method says: make a theory, then test it against observation.
That's... not the scientific method I learned. The one I learned starts with "Observe a phenomenon". Of course, then you develop a hypothesis and formalize the observation process so that conclusions can be more reliably drawn from it (an experiment), but you're still observing something first.
At the risk of being self-congratulatory, I submit an independently arrived-at first conceptiopn of Bayesian thinking.
I arrived at a Bayesian solution one night to my own problem indepedent of ever learning Bayes. My undergrad was in Physics, but I don't recall going over Bayes before that night. The first time I heard about Bayes, that I recall, was pg's bio, which had to be after Reddit was founded.
Anyway, Much like tonight, I awoke in the middle of the night realizing I could calculate my odds of getting into medical school. The illustration is the key:
Imagine you have been locked inside a room. You are asked what the outside weather is. Assume you have to pick one of sunny, cloudy and rainy. You know it is July so you predict it is mostly sunny. (Because intuitively you know that the odds of "sunny" is more than the other two). [1]
Now you see that someone has entered the room carrying an umbrella. You are asked that question again. Of course you are going to update your answer based on the new information. Your previous odds calculation, while good, has to accommodate the umbrella factor -- a very solid piece of additional information. [2]
Before the umbrella happened all you knew was [1] and it is called Prior. After you saw the umbrella the odds are conditioned on the umbrella and it is called Posterior [2]. which is what you want. And you should definitely use your intuition about the weather too [1] which now goes to the right hand side of the equation.
So
Posterior = ( ) x Prior
This blank is called as Likelihood... it is the probability of umbrella given the weather.
Imagine you live in a place where the umbrella is used only during rain. So the likelihood of an umbrella given it is sunny is very very less and is going to make the product of Prior x Likelihood very less.
Whereas the likelihood of umbrella given it is rainy is almost certain. Even if your initial guess (prior) gives less odds to rain.. the product of likelihood x prior for rain is more than that of the previous paragraph product. You answer "rainy"
Of course there is more to it and I suggest reading other great explanations in this thread.