Basic Bayesian reasoning: a better way to think (Part 1)
Image: by me. Feel free to use, just link back to this post. |
Bayesian inference is the mathematical extension of propositional logic using probability theory. It is superior to deductive propositional logic, which is what many people think of when they hear the word "logic". In fact it includes the rules of propositional logic as special cases of its more powerful and general rules. It is the logical framework that underlies the scientific method, and it encompasses a great deal of what it means to be a rational, logical, scientific individual. As with "normal", propositional logic, you don't necessarily have to be formally trained to use it in your daily life, but knowing its basics will greatly clarify your thoughts and sharpen your rational thinking skills. The intent of this post is to provide an introduction to this important topic.
Let's study an easy problem in propositional logic as a prerequisite review and a starting point for Bayesian logic. You should have learned in middle or high school that if A implies B, and B implies C, then A implies C. With some symbols, it becomes "if (A → B) and (B → C), then (A → C)". In an example with words, it might look like "If Socrates was a human, and all human are mortals, then Socrates was mortal". This is well and good. This is a fine way of thinking. Learning how to think this way is useful and worth learning.
However, when we examine the world around us, this rule is severely restricted in its applicability. Consider the following: "If Socrates was a human, and all humans have ten fingers, then Socrates had ten fingers". Is this sound? Can we conclude that Socrates necessarily had ten fingers? Well, no. The second premise - "all humans have ten fingers" - is not strictly true. Certainly most humans do, but not all. So we cannot conclude that Socrates had ten fingers. For that matter, we're not completely 100% sure that Socrates was human either.
"What's wrong with that?" You ask. "Hasn't logic brought us to a correct conclusion, that Socrates might not have had ten fingers?" True. But that's a very weak conclusion. Someone who was basing an argument on the possibility that Socrates didn't have ten fingers would need some additional evidence. I mean, until now I had implicitly assumed that Socrates had ten fingers, and I don't think I was being particularly irrational. Isn't there some way to conclude that "Socrates probably had ten fingers"? Maybe with a rule like "If A is likely to lead to B, and B is likely to lead to C, then A is likely to lead to C"? Doesn't that seem like a pretty logical conclusion?
Of course, "If A is likely to lead to B, and B is likely to lead to C, then A is likely to lead to C" is not a valid argument in propositional logic, and you can certainly find examples where A is true while C is not. For instance, "A blind person is likely to be a poor driver. Poor drivers are likely to get traffic tickets. Therefore, a blind person is likely to get traffic tickets" seems to be a incorrect chain of reasoning. But how could we be sure that it's not just an instance of bad luck, that this is just one of the cases where that probabilistic statement, "likely", just didn't pan out?
So we can't come to any firm conclusions about the "likely" rule in logic, although it seems to make sense sometimes. At any rate we can't use rigid propositional logic with such statements. But this is an enormous restriction, because there is nothing we know in the physical world with absolute certainty. Every instrument of measurement - including your own eyes and hands - are subject to errors and uncertainty. Even if you double check and verify, that only reduces the uncertainty to infinitesimal levels, without ever completely eliminating it. How can we reason in such cases - that is, in any real world scenarios where we are perpetually plagued by uncertainty?
In Bayesian reasoning, these uncertainties are built into its foundations. The truth of a statement is not represented by just "true" and "false", but by a continuous numerical probability value between 0 and 1. So, for instance, the statement "It will rain tomorrow" might get a probability value of 0.1, representing a 10% chance of rain. A statement like "I will still be alive tomorrow" might get a value like 0.999999, as I will almost certainly not die today. "1" and "0" would respectively correspond to absolute certainty in the truth or falsehood of a statement, but as I said they cannot be used in statements about the physical world. Instead we use numbers like 0.5 to represent the certainty that the coin will land heads, or 0.65 to represent the certainty you feel that you're going to marry that girl.
But isn't any given statement ultimately either true or false? Perhaps, but we are not God. We're ignorant of many things. But we still need to reason, even in our uncertainties. Giving a numerical, probabilistic truth value to a statement allows Bayesian reasoning to mirror the human mind much more closely than propositional logic. In essence, you can treat the numerical value you give to a statement as your personal degree of subjective certainty that the statement is true, given the information that you have.
But isn't this all very probabilistic, subjective, and uncertain? In one sense, yes. And that is a strength of Bayesian reasoning, because that's an actual limitation of the human mind. In representing the truth in this way we are only accurately representing how the truth actually exists in our minds. If we actually cannot be certain, then it's appropriate that our logical system actually represents that uncertainty.
But in another sense, this probabilistic thinking is completely rigorous and unyielding. By assigning a probabilistic value to the truth, you can use all of the mathematical tools of probability theory to process them, and their conclusions are mathematically certain. Bayesian reasoning makes very definitive statements about what these probability values must be, and how they must change in light of new evidence. I said earlier that Bayesian reasoning is an extension of logic using math, and that is exactly as rigorous and compelling as it sounds.
To finish this post, let me give you an extended example to illustrate both the rigor and flexibility of Bayesian reasoning, and its superseding superiority over propositional logic. We will address the question about Socrates and his fingers. There will be some math ahead, but nothing you can't understand at a high school level.
First, let me introduce some notations:
P(X) is the probability that you assign to statement X being true. So, if statement X is "I will roll a 1 on this dice", you might assign P(X)=1/6. But if you happened to know that the dice was loaded, you might assign P(X)=1/2 instead.
P(X|Y) is the probably that X is true, given that Y is already known to be true. So, if X is "It will rain tomorrow" and Y is "It will be cloudy tomorrow", then P(X) might only be 0.1, whereas P(X|Y) would be larger, perhaps something like 0.3. That is, there is only a 10% chance of rain tomorrow, but if we know that it will be cloudy, the chance of rain increases to 30%. Notice that the probabilities change depending on what additional relevant information is known.
This P(X|Y) notation is a little bit awkward, as it's written backwards from the more intuitive "if Y, then X" way of thinking. But unfortunately it's the standard notation. By definition, P(X|Y) = P(XY)/P(Y), where P(XY) is the probability that both X and Y are true.
~X is the negation of X. It is the statement that "X is false". By the rules of probability, P(~X)+P(X)=1, because X must be either true or false. Likewise, P(~X|Y)+P(X|Y)=1, and P(~XY)+P(XY)=P(Y)
You can translate a statement in propositional logic into a statement in Bayesian, probabilistic representation, simply by setting certain probabilities to 1 or 0. For instance, "X implies Y", which would be written as "X → Y" in propositional logic, would be written as "P(Y|X)=1" in terms of probabilities.
Now back to Socrates. Let the relevant statements be represented as follows:
A: "This person is Socrates"
B: "This person is a Human"
C: "This person has ten fingers"
"Socrates was Human": P(B|A)
"Humans have ten fingers": P(C|B)
Now, let's show that we can duplicate the results of propositional logic simply by setting the probabilities to 1. If P(B|A) = P(C|B) = 1, then by the definitions given earlier, P(BA)=P(A), P(CB)=P(B), therefore P(~BA)=P(~CB)=0, therefore P(C~BA)=P(~CBA)=P(~C~BA)=0. But P(C|A) = [ P(CBA)+P(C~BA) ] / [ P(CBA)+P(C~BA)+P(~CBA)+P(~C~BA) ], which reduces to P(CBA)/P(CBA) =1 after eliminating all the zero terms. That is to say, if P(B|A) = P(C|B) = 1, then P(C|A) = 1. Or, translating back into words, "If Socrates is human, and humans have ten fingers, then Socrates has ten fingers".
Don't worry too much if you got lost in the notation in the above paragraph. The important point is that Bayesian reasoning can reduce down to propositional logic for the special cases where the probability values are set to 1 or 0. Bayesian reasoning thereby completely encompasses and supersedes propositional logic, like General relativity supersedes Newtonian gravity.
What if the probabilities are not 100%? This is the real-life problem of dealing with uncertainties. What is the actual value of P(B|A), the probability that Socrates was human? Might he not have been an alien, or an angel? As ridiculous as these possibilities seem, they ruin our complete certainty and makes propositional logic flounder. What about P(C|B) - the probability that a human has ten fingers? It's certainly not 100%. And what can we conclude about P(C|A) - the probability that Socrates had ten fingers?
To tackle this question, we need to consider the following formula for P(C|A), which can be derived from straightforward application of the rules and definitions mentioned earlier. The fact that this formula exists - that we can actually derive it and use it to perform exact calculations - is one of the compelling fruits of the Bayesian way of thinking. Here it is:
P(C|A) = P(C|BA)P(B|A) + P(C|~BA)P(~B|A)
Let's say that Socrates has a P(B|A)=0.999 999 chance of being human, and that given all this, he has a P(C|BA)=0.998 chance of having ten fingers. This means that P(~B|A) = 0.000 001 is the chance that Socrates was not human. The last factor we need to know, P(C|~BA), is the probability that a non-human Socrates had ten fingers. This is nearly impossible to estimate, as we'd have to consider all the different things Socrates could have been - alien, angel, a demon in disguise, etc. But it will turn out not to matter much for our final result. Let's just assign P(C|~BA)=0.1. Plugging in the numbers and calculating, we get that P(C|A) = 0.997999102. That is to say, Socrates almost certainly had ten fingers.
If you want extra practice, you can also try using the formula on the "blind man getting traffic tickets" scenario above, and see why a blind man is not likely to get traffic tickets. Or you can wait until next week's post to get the answer.
But again, don't worry too much about the details of numerical calculation. The important point is that Bayesian reasoning provides an exact formula for calculating the probability of a conclusion, even when the premises were also only probabilities - which is always the case in the physical universe. Furthermore, the conclusions drawn this way are compelling, because they are mathematical results. If you accept the probabilities of the premises, then you must accept the conclusion. This is the same compelling force which is at work in propositional reasoning. The premises lead to inescapable conclusions.
I hope that this example demonstrates to you the usefulness of the Bayesian way of reasoning. It can be actually applied to situations with uncertain premises, which is really nearly all situations. It is completely rigorous in that a correct Bayesian argument forces you to accept its conclusions if you accept its premises. Yet it's also flexible in assigning probabilities to reflect your current, subjective, personal degree of belief in the truthfulness of a statement. It duplicates propositional logic as its special cases, and in its full form it's more general and more powerful than propositional logic. There are other advantages I have not yet touched on, such as its ability to naturally explain inductive reasoning and Occam's razor, and how it serves as the framework for the scientific method. On the whole, it encompasses a great deal of what it means to be a logical, rational, and scientific thinker.
In my next post, I will discuss a particularly important formula in this probabilistic way of thinking, one that is nearly synonymous with Bayesian reasoning - Bayes' theorem.
Basic Bayesian reasoning: a better way to think (Part 2)
Image: portrait of Thomas Bayes, public domain |
Essentially, we can now use any valid equation in probability theory as a rule of logic. We saw an useful example in the last post: P(C|A) = P(C|BA)P(B|A) + P(C|~BA)P(~B|A). This captures the intuitive idea that if A is likely to lead to B, and B is likely to lead to C, then A is likely to lead to C. But it also does more - it tells us precisely how to calculate the probability for our conclusion, while simultaneously sharpening, guiding, and correcting our thinking. In particular, it tells us that in the second step, it's not enough that B is likely to lead to C, instead requiring that BA is likely to lead to C. (Incidentally, this is why a blind person is not likely to get traffic tickets, even though a blind person is likely to be a bad driver, and bad drivers are likely to get tickets.)
In this post, I will introduce another such equation in probability theory:
P(A|B) = P(B|A)/P(B) * P(A)
This is Bayes' Theorem. Named after Reverend Thomas Bayes, this unassuming little equation, which can be derived immediately from the definition that P(A|B) = P(AB)/P(B), is so important that its use is nearly synonymous with Bayesian logic, and its interpretation is the logical basis for the scientific method. At its heart, this equation tells you how to update your beliefs based on the evidence. To see how that works, set A = "hypothesis", and B = "observation" in the formula. The equation then becomes:
P(hypothesis|observation) = P(observation|hypothesis)/P(observation) * P(hypothesis)
Each factor can then be translated into words as:
P(hypothesis): probability that the hypothesis is true, before considering the observation.
P(hypothesis|observation): probability that the hypothesis is true, after considering the observation.
P(observation|hypothesis): probability for the observation, as predicted by the hypothesis.
P(observation): probability for the observation, averaged over the predictions from every hypothesis.
This equation tells us how we should update our opinion on a hypothesis after we make a relevant observation. That is, it tells us how to go from P(hypothesis) to P(hypothesis|observation). It says that a hypothesis becomes more likely to be true if it's able to predict an observation better than the "average" hypothesis: the bigger the ratio of P(observation|hypothesis)/P(observation), the more likely the hypothesis becomes. Conversely it becomes less likely to be true if it could not beat the "average" hypothesis in its predictions. In short, it says that an observation counts as evidence for the hypothesis that better predicted it. We already intuitively knew this to be true - but Bayes' theorem states it in a mathematically rigorous fashion, and allows us to put firm numbers to some of these factors.
Let's consider an example: Alice and Bob go out on a date. Bob liked Alice and wants to ask her to a second date, but he's not sure how she'll respond. So he hypothesizes two possible outcomes: Alice will say "yes" to a second date, or she will say "no". Based on all the information he has - how Alice acted before and during the date, how they communicated afterwards, etc. - he thinks that there's a 50-50 chance between Alice saying "yes" or "no". That is to say:
P(Alice will say "yes") = P(Alice will say "no") = 0.5
For the sake of simplicity, we will not consider other possibilities, such as Alice saying some form of "maybe". These two "yes" and "no" will serve as our complete set of possible hypotheses.
While Bob is agonizing over this second date, he runs into Carol, who is a mutual friend to both Alice and Bob. She tells Bob, "Alice absolutely loved it last night! She can't wait to go out with you again!". Carol's affirmation serves as evidence that Alice will say "yes" to a second date. We already knew this intuitively: Carol's affirmation is obviously good news for Bob. But Bayes' theorem allows us to calculate the probability explicitly from some starting probabilities. To see this, we need to evaluate two probability values: P(Carol's affirmation|Alice will say "yes"), and P(Carol's affirmation|Alice will say "no").
What value should we assign to P(Carol's affirmation|Alice will say "yes")? That is, if Alice would say "yes" to a second date, what is the probability that Carol would have given Bob her affirmation? Not particularly high - After all, Carol could have simply forgotten to mention Alice's reaction, or Alice and Carol might not have had a chance to discuss the first date, or Alice could have had a terrible time, but she might still give Bob a second chance. All these are ways that the "yes" hypothesis might not lead to Carol's affirmation. Taking these things into account, let's say that P(Carol's affirmation|Alice will say "yes") = 0.2.
What about P(Carol's affirmation|Alice will say "no")? This is the probability that Carol would still communicate her affirmation to Bob, even though Alice would say "no" to a second date. Now, it could be that Alice hated her first date with Bob, but Carol deliberately lied to him. Or maybe Carol simply wanted to encourage Bob even though she didn't really know how Alice felt. Or Alice did really enjoy her time with Bob, but she'll be suddenly struck by amnesia before Bob asks her out again. But assuming that Alice and Carol are honest people, and that nothing particularly strange happens, it's very unlikely that Carol gives Bob her affirmation if Alice is going to say "no". So let's say that P(Carol's affirmation|Alice will say "no") = 0.02
Now, what about P(Carol's affirmation)? This is the last factor we need to apply Bayes' theorem. This is the probability that Carol gives Bob her affirmation, averaged over both the "yes" and "no" hypotheses. Since there's a 50-50 chance that Alice will say "yes" or "no", this is simply the average of the two probabilities mentioned above: 0.5*0.2 + 0.5*0.02 = 0.11. This step can get complicated, but because of the 50-50 chance for our two hypotheses, it is mercifully short in this simple example. So P(Carol's affirmation)=0.11.
This now gives Bob enough information to compute P(Alice will say "yes"|Carol's affirmation). That is, given that Carol told Bob that Alice wants to go out again, what is the probability that Alice will answer "yes" to a second date? According to Bayes' theorem:
P(Alice will say "yes"|Carol's affirmation) =
P(Carol's affirmation|Alice will say "yes")/P(Carol's affirmation) * P(Alice will say "yes") =
0.2/0.11 * 0.5 = 0.909090... = 10/11
Carol's affirmation, upon considering it as evidence, has pulled the probability from 50% to 91%. That is, if Bob thought before that there was only a 50% chance that Alice will agree to a second date, he should now think that there is a 91% chance. That is what evidence does: it pulls the probability for a hypothesis in one direction or another. A strong piece of evidence might pull it all the way from 0.1% to 99.9%, whereas a weak piece of evidence might only pull it from 50% to 60%. An opposing piece of evidence will pull the probability in the other direction, as in a tug-of-war. This is why we commonly speak of "weighing" the evidence. This exemplifies how Bayesian reasoning corresponds to the common sense we use in everyday life, except that it's mathematically precise.
Note that Bayes' theorem, as with all Bayesian reasoning, compels you to accept its conclusions: you cannot simply say "I don't buy this argument" or "I don't find this convincing". If you accept its premises, you must accept it conclusion: otherwise you're violating the rules of mathematical logic.
But where do the premises come from? How did I assign, for example, that P(Alice will say "yes")=0.5, or P(Carol's affirmation|Alice will say "no")=0.02? Well, for the sake of this problem, I just made up some reasonable values. In real life, computing these values would be far more difficult than the example problem itself. For instance, to calculate P(Alice will say "yes"), Bob would have to consider all the relevant background information he has. This would include how Alice interacted with him during the first date, his knowledge of human mating behaviors (it's a good sign if she laughs at your jokes, it's bad if she calls the cops on you, etc), and any other relevant information. Based on this total information, he would calculate how often a woman like Alice would agree to a second date, and that would be his P(Alice will say "yes"). That's why this probability can be thought of as a personal, subjective degree of belief: because nobody else has the exact set of background information that Bob has.
What about calculating P(Carol's affirmation|Alice will say "no")? This is the probability that Carol will convey Alice's approval to Bob, even though Alice will say "no" to the second date. This number might be obtained through some sociological studies, by asking questions like "How often do women tell their friends that they enjoyed a date even if they didn't?" The nature of the relationship between Alice, Bob, and Carol also needs to be taken into account, along with their personalities. Are Alice and Carol very close friends? Is Carol generally reliable, or is she prone to hyperbole? The value of P(Carol's affirmation|Alice will say "no") is all this information condensed into a single number.
You may be disappointed that these probabilities are not simple to calculate. This is often the case in real-life scenarios. It turns out that humans and human relationships cannot be reduced down to a simple calculation, even with Bayes' theorem. Real life is complicated: this should not surprise anyone. Often, the relevant starting probabilities can only be guessed at from intuition. Being able to do that well is a large part of what it means to be a reasonable, logical person in the real world.
So those are the strengths and weaknesses of Bayes' theorem. On the one hand, it provides a firm, computationally exact way of updating your beliefs based on the evidence. On the other hand, the probabilities needed to perform the calculations can be difficult or impossible to assign. In particular, the assignment of prior probabilities - The degree of belief in the hypothesis before considering the observation - is a famously contentious issue within Bayesian reasoning, and there is no way to assign these numbers that's been established as being correct. This was the value of P(Alice will say "yes") in our example above, and I have described it as a personal, subjective probability based on the unique set of background information that a person has. This gives us a ballpark number that we can immediately use, but its imprecise and subjective nature, combined with the human capacity for self-deception, is a cause for concern.
Is Bayesian reasoning still useful in light of these weaknesses? Definitely. It is still far more applicable than propositional logic, and it still tells us, in a mathematically precise way, how to logically incorporate evidence into your beliefs. And often, when there is enough evidence, the specific values of these questionable probabilities turn out to be irrelevant. This is why we often look for overwhelming evidence, beyond any reasonable doubt, before we decide to take action based on a hypothesis. So while Bayes' theorem cannot tell us everything (nothing in this world can), it is a very useful tool for sharpening our thinking and processing evidence to update our beliefs.
In my next post, I will re-cast Bayes' theorem into a different mathematical equation - the odds form - which eases some of the difficulties of Bayesian reasoning. I will use this new form to discuss more examples in Bayesian reasoning.
Basic Bayesian reasoning: a better way to think (Part 3)
In my last post, I introduced Bayes' theorem:
P(hypothesis|observation) = P(observation|hypothesis)/P(observation) * P(hypothesis)
Now, this is a powerful equation that tells us how to use observed evidence to update our beliefs about a hypothesis. But as I mentioned, it has two difficulties with its use: first, the probability prior to the observation - P(hypothesis) - is famously difficult to compute in a clear, objective manner, and it changes based on the background information that each person has. For these reasons it's often said to be a personal, subjective probability, reflecting a particular person's degree of belief based on his or her unique set of background information.
And second, things get even worse for P(observation): this is the probability of making the observation, averaged over the complete set of competing hypotheses. Because this is an average over the complete set, we have to know all P(hypothesis) values for every competing hypothesis. But as we said just in the previous paragraph, computing even one of these values is difficult. If that wasn't hard enough, in real-life situations we may not even be able to enumerate the complete set of competing hypotheses. And then, even if we somehow got through all these difficulties, we still have to calculate P(observation|hypothesis) values for each of these hypotheses, which itself is no trivial task, then calculate their average across all the hypotheses. This step often requires more computation than the rest of Bayes' theorem put together, even for well-defined problems with fixed values for all other probabilities.
For these reasons I often like to use Bayes' theorem in odds form: simply write down the equations for two different hypotheses and divide one by the other, and you get:
P(hypothesis A|observation)/P(hypothesis B|observation) =
P(hypothesis A)/P(hypothesis B) * P(observation|hypothesis A)/P(observation|hypothesis B)
This can be summarized as "posterior odds = prior odds * likelihood ratio (of the observation being made from each hypothesis)", where:
P(hypothesis A|observation)/P(hypothesis B|observation) = posterior odds,
P(hypothesis A)/P(hypothesis B) = prior odds,
P(observation|hypothesis A)/P(observation|hypothesis B) = likelihood ratio.
Let's go through an example: say you're investigating a murder. You think that Alice is twice as likely to be guilty compared to Bob - this is your prior odds. You then observe fingerprints on the murder weapon that are 3000 times more likely to have come from Alice than from Bob - this is the likelihood ratio. You multiply these ratios to calculate your new opinion, the posterior odds: Alice is now 6000 times more likely to be guilty than Bob. Posterior odds is prior odds times likelihood ratio.
This is still Bayes' theorem, just in a different algebraic form. The intuition captured by this equation is the same: an observations counts as evidence towards the hypothesis that better predicts, anticipates, explains, or agrees with that observation. But notice that in this form, P(observation) - which was difficult or impossible to calculate - has been cancelled out. Also, P(hypothesis) - another troublesome number - only appears in a ratio of two competing hypotheses, which I think is a more reasonable way to think of it: it's easier to say how much more likely one hypothesis is than another, instead of assigning absolute probabilities to both of them. In short, this form makes the math easier, and allows you to think of just two hypotheses at a time, rather than having to account for the complete set of competing hypotheses all at once. You don't have to worry about Carol and her fingerprints for the time being in the above murder investigation example.
Let's go through a couple more examples:
Say that your friend claims that he has a trick coin: he says it lands "heads" all the time, rather than the 50% of the time that you'd normally expect. You're somewhat skeptical, and based on his general trustworthiness and the previous similar claims he's made, you only think that there's a 1:4 odds that this is a 100% "heads" coin, versus it being a normal coin. This is your P(always heads)/P(normal), the prior odds.
When you express your skepticism, your friend says, "well then, let me just show you!" and flips the coin. It lands "heads". "See!" says your friend. "I told you it'll always lands heads!" Now, obviously a single flip doesn't prove anything. But it certainly is evidence - not very strong evidence, but some evidence. Since the coin will land "heads" 100% of the time if your friend is right, but only 50% of the time if it's a normal coin, their ratio - the likelihood ratio - is 100%:50%, or 2:1.
Now, according to the odds form of Bayes' theorem, posterior odds is prior odds times likelihood ratio. 1:4 * 2:1 = 1:2, so you should now believe that there's a 1:2 odds that this is a trick coin like your friend claimed, versus it being a normal coin. You're still skeptical of the claim, but you're now less skeptical.
Noting your remaining skepticism, your friend then flips the coin again. "Ha, another heads!" he says as he calls out the result. Now, to calculate your new opinion, simply repeat the calculation above, with the previous answer - the old posterior odds of 1:2 - serving as the new prior odds. The likelihood ratio remains 2:1. Posterior odds is prior odds times likelihood ratio, so our new posterior odds is 1:2*2:1 = 1:1. You should now be completely uncertain as to whether this coin in fact is a trick coin. You say to your friend, "well, you may have something there".
"Okay, fine then." says your friend. "Let's flip this thing ten more times." And behold, it comes up "heads" all ten times. Your posterior odds get multiplied by 2:1 for each of the ten flips, and it's now 1:1 * (2:1)^10 = 1024:1. You should now believe that the chance of this being an "always heads" coin is 1024 times greater than it being a normal coin. If you're willing to consider "normal" and "always heads" as the complete set of competing hypotheses, this would give you over 99.9% certainty that your friend is right that this coin will always land heads.
"Wow, amazing." you tell your friend, as you're now pretty much convinced. "I've never actually seen one of these before", you say, as you idly grab the coin and flip it again, fully expecting it to land "heads" once more. But this time, it lands "tails".
What now? The likelihood ratio for the coin to land "tails" - P(tails|always heads)/P(tails|normal) - is 0%:50%, or 0:1. Our new posterior odds is 1024:1 * 0:1 = 0:1. There is now absolutely no chance that this coin is one that will land heads 100% of the time. But at the same time, it also seems unlikely that it's just a normal coin. given that it landed "heads" 12 times in a row just before this. A new possibility suggests itself: that this coin has something like a 90% chance of landing heads.
This illustrates one of the major advantages of the odds form of Bayes' theorem. Before this, you hadn't even considered that the chance for this coin to land "heads" was anything other than 50% or 100%. All of the other hypotheses - such as the coin landing "heads" 90% or 80% or 20% of the time - you had ignored. And yet, even without considering the complete set of competing hypotheses, you were still able to carry out valid calculations and make statistical inferences, reaching sound conclusions.
You both stare at the coin that landed "tails". You ask your friend, "What just happened?" He replies, "well, the magician I bought it from said that it would always land heads. And it seemed to be working fine up 'til now. Maybe he just meant that it'll land heads most of the time?" Being naturally suspicious, you respond, "Looks like he lied to you then. He probably just sold you a normal coin". But your friend comes back with, "C'mon, you know that's not fair. Human language doesn't work like that. It's imprecise by its very nature. When someone says 'always' in casual conversation, they don't necessarily mean '100.000000...% of the time' with an infinite number of significant figures. Even 'normal' coins don't land heads exactly 50.000000...% of the time". Struck by your friend's rare moment of lucid articulation, you become temporarily speechless. "Besides", your friend continues, "the magician might have said that the coin 'nearly always lands heads'. I don't remember exactly".
With this new insight, you realize that your had set your priors to the wrong hypotheses at the beginning of the problem. Instead of the hypotheses that the coin to land "heads" exactly 100% of the time, or exactly 50% of the time, you should have set them to 'close to 100% of the time' and 'close to 50% of the time'. Giving the odds of P(close to 100%)/P(close to 50%) = 1:4 as before, and interpreting "close to" as a flat distribution within 2% of the given value, We get that the likelihood ratio for the coin landing "heads" is P(heads|close to 100)/P(heads|normal) = 99%:50% = 1.98:1, and for the coin landing "tails" is P(tails|close to 100)/P(tails|normal) = 1%:50% = 0.02:1. Then the value for the posterior odds after 12 heads and 1 tails is given by prior odds times likelihood ratio, and it is roughly:
1:4 * (1.98:1)^12 * 0.02:1 = 18.15:1
(This is an approximation, made by assuming that the probability distribution can be thought of as being entirely focused at the center of their interval. The actual value, 16.97:1, can be obtained by a straightforward integration over the probability distributions, but that calculation lies beyond the scope of this introductory post.)
So you don't have to abandon the "close to 100%" hypothesis along with the "exactly 100% hypothesis. The odds are still 18:1 in favor of the coin landing "heads" more than 98% of the time, against it being a "normal" coin - enough for you to be reasonably confident in believing as your friend does.
This illustrates again the advantages of using the odds form. Firstly, we again didn't have to consider other probability values for the coin landing "heads", such as 75%. We were still able to come to a reasonable conclusion without having to specify the complete set of competing hypotheses, and their probability distribution. Secondly, we were able to completely switch the class of hypotheses under consideration, without losing consistency. If we had stuck to the original form of Bayes' theorem, then we would have had to specify our prior probabilities for P(heads exactly 100% of the time) and P(heads exactly 50% of the time). To maintain our 1:4 ratio, we would assign them as 20% and 80%, taking up all 100% of our probability, because we were not thinking about other possibilities. But then, upon realizing our mistake, we would have no choice but to contradict our previous priors, and assign P(heads close to 100% of the time) and P(close to 50% of the time) some values, while going back and admitting that the chances of the coin giving exactly 50% or 100% "heads" are nearly zero. This is a problem created entirely by being unaware of the complete set of competing hypotheses.
But with the odds form, we don't have to have complete awareness. All the conclusions that we came to are still perfectly consistent with the data: there is zero chance for the coin to land "heads" exactly 100% of the time, yet it is much more likely that the "heads" probability is close to 100% than it being a normal coin. Our two sets of priors do not contradict each other either: it's quite reasonable for our prior odds to be 1:4 in both cases, because we have not specified how much of the total probability they take up. In general, I feel that it's easier to say how likely two hypotheses are relative to one another, rather than specifying the absolute probability value for a hypothesis.
I hope this convinces you of the virtues of the odds form of Bayes' theorem. This is how I use Bayes' theorem in everyday situations to sharpen my thinking: I didn't know if this one movie was going to be any good (prior odds), but upon its recommendation from a friend (likelihood ratio), I revise my opinion and are now more likely to see it (posterior odds). I didn't know whether Argentina or Germany is more likely to win the World Cup (prior odds), but upon watching Germany slaughter Brazil (likelihood ratio), I now consider Germany more likely than Argentina to win the World Cup (posterior odds). So on and so forth. Posterior odds is prior odds times likelihood ratio.
Let's consider a couple of last examples:
I don't know if Bill Gates owns Fort Knox (prior odds). But I know that he's rich, and he's more likely to be rich if the owns Fort Knox than if he does not (likelihood ratio). Therefore, given that Bill Gates is rich, he's more likely to own Fort Knox (posterior odds).
Does that reasoning sound suspicious? It should. I took it straight from the Wikipedia page on "affirming the consequent", which is a logical fallacy. But the structure of the above argument is correct according to Bayes' theorem. It follows the same structure as all of my other examples. So, has Bayesian reasoning lead to a logical fallacy? Oh no! What shall we do?
Hold that thought, while we consider our last example:
I don't know whether Einstein's theory of general relativity, or Newton's theory of gravity is correct. (prior odds). But upon considering the experimental evidence of bending of starlight observed during the 1919 solar eclipse (likelihood ratio), I now consider general relativity much more likely to be correct than Newtonian gravity (posterior odds).
You should recognize that as the event that actually "proved" general relativity to the public, and the epitome of the scientific method at work: hypotheses are judged according to their agreement with experimental observations. But this is nothing more than just straightforward Bayesian reasoning, following the same structure as all of my other examples. So, it turns out that Bayesian reasoning underlies the scientific method, by providing the logical framework for it.
What are we to make of these two last examples? Does Bayesian reasoning allow for affirming the consequent? But isn't that a logical fallacy? But doesn't Bayesian reasoning also underlie the scientific method? Does that mean that science follows a logically flawed system? What are we to make of this?
I will address these issues in my next post.
P(hypothesis|observation) = P(observation|hypothesis)/P(observation) * P(hypothesis)
Now, this is a powerful equation that tells us how to use observed evidence to update our beliefs about a hypothesis. But as I mentioned, it has two difficulties with its use: first, the probability prior to the observation - P(hypothesis) - is famously difficult to compute in a clear, objective manner, and it changes based on the background information that each person has. For these reasons it's often said to be a personal, subjective probability, reflecting a particular person's degree of belief based on his or her unique set of background information.
And second, things get even worse for P(observation): this is the probability of making the observation, averaged over the complete set of competing hypotheses. Because this is an average over the complete set, we have to know all P(hypothesis) values for every competing hypothesis. But as we said just in the previous paragraph, computing even one of these values is difficult. If that wasn't hard enough, in real-life situations we may not even be able to enumerate the complete set of competing hypotheses. And then, even if we somehow got through all these difficulties, we still have to calculate P(observation|hypothesis) values for each of these hypotheses, which itself is no trivial task, then calculate their average across all the hypotheses. This step often requires more computation than the rest of Bayes' theorem put together, even for well-defined problems with fixed values for all other probabilities.
For these reasons I often like to use Bayes' theorem in odds form: simply write down the equations for two different hypotheses and divide one by the other, and you get:
P(hypothesis A)/P(hypothesis B) * P(observation|hypothesis A)/P(observation|hypothesis B)
This can be summarized as "posterior odds = prior odds * likelihood ratio (of the observation being made from each hypothesis)", where:
P(hypothesis A|observation)/P(hypothesis B|observation) = posterior odds,
P(hypothesis A)/P(hypothesis B) = prior odds,
P(observation|hypothesis A)/P(observation|hypothesis B) = likelihood ratio.
Let's go through an example: say you're investigating a murder. You think that Alice is twice as likely to be guilty compared to Bob - this is your prior odds. You then observe fingerprints on the murder weapon that are 3000 times more likely to have come from Alice than from Bob - this is the likelihood ratio. You multiply these ratios to calculate your new opinion, the posterior odds: Alice is now 6000 times more likely to be guilty than Bob. Posterior odds is prior odds times likelihood ratio.
This is still Bayes' theorem, just in a different algebraic form. The intuition captured by this equation is the same: an observations counts as evidence towards the hypothesis that better predicts, anticipates, explains, or agrees with that observation. But notice that in this form, P(observation) - which was difficult or impossible to calculate - has been cancelled out. Also, P(hypothesis) - another troublesome number - only appears in a ratio of two competing hypotheses, which I think is a more reasonable way to think of it: it's easier to say how much more likely one hypothesis is than another, instead of assigning absolute probabilities to both of them. In short, this form makes the math easier, and allows you to think of just two hypotheses at a time, rather than having to account for the complete set of competing hypotheses all at once. You don't have to worry about Carol and her fingerprints for the time being in the above murder investigation example.
Let's go through a couple more examples:
Say that your friend claims that he has a trick coin: he says it lands "heads" all the time, rather than the 50% of the time that you'd normally expect. You're somewhat skeptical, and based on his general trustworthiness and the previous similar claims he's made, you only think that there's a 1:4 odds that this is a 100% "heads" coin, versus it being a normal coin. This is your P(always heads)/P(normal), the prior odds.
When you express your skepticism, your friend says, "well then, let me just show you!" and flips the coin. It lands "heads". "See!" says your friend. "I told you it'll always lands heads!" Now, obviously a single flip doesn't prove anything. But it certainly is evidence - not very strong evidence, but some evidence. Since the coin will land "heads" 100% of the time if your friend is right, but only 50% of the time if it's a normal coin, their ratio - the likelihood ratio - is 100%:50%, or 2:1.
Now, according to the odds form of Bayes' theorem, posterior odds is prior odds times likelihood ratio. 1:4 * 2:1 = 1:2, so you should now believe that there's a 1:2 odds that this is a trick coin like your friend claimed, versus it being a normal coin. You're still skeptical of the claim, but you're now less skeptical.
Noting your remaining skepticism, your friend then flips the coin again. "Ha, another heads!" he says as he calls out the result. Now, to calculate your new opinion, simply repeat the calculation above, with the previous answer - the old posterior odds of 1:2 - serving as the new prior odds. The likelihood ratio remains 2:1. Posterior odds is prior odds times likelihood ratio, so our new posterior odds is 1:2*2:1 = 1:1. You should now be completely uncertain as to whether this coin in fact is a trick coin. You say to your friend, "well, you may have something there".
"Okay, fine then." says your friend. "Let's flip this thing ten more times." And behold, it comes up "heads" all ten times. Your posterior odds get multiplied by 2:1 for each of the ten flips, and it's now 1:1 * (2:1)^10 = 1024:1. You should now believe that the chance of this being an "always heads" coin is 1024 times greater than it being a normal coin. If you're willing to consider "normal" and "always heads" as the complete set of competing hypotheses, this would give you over 99.9% certainty that your friend is right that this coin will always land heads.
"Wow, amazing." you tell your friend, as you're now pretty much convinced. "I've never actually seen one of these before", you say, as you idly grab the coin and flip it again, fully expecting it to land "heads" once more. But this time, it lands "tails".
What now? The likelihood ratio for the coin to land "tails" - P(tails|always heads)/P(tails|normal) - is 0%:50%, or 0:1. Our new posterior odds is 1024:1 * 0:1 = 0:1. There is now absolutely no chance that this coin is one that will land heads 100% of the time. But at the same time, it also seems unlikely that it's just a normal coin. given that it landed "heads" 12 times in a row just before this. A new possibility suggests itself: that this coin has something like a 90% chance of landing heads.
This illustrates one of the major advantages of the odds form of Bayes' theorem. Before this, you hadn't even considered that the chance for this coin to land "heads" was anything other than 50% or 100%. All of the other hypotheses - such as the coin landing "heads" 90% or 80% or 20% of the time - you had ignored. And yet, even without considering the complete set of competing hypotheses, you were still able to carry out valid calculations and make statistical inferences, reaching sound conclusions.
You both stare at the coin that landed "tails". You ask your friend, "What just happened?" He replies, "well, the magician I bought it from said that it would always land heads. And it seemed to be working fine up 'til now. Maybe he just meant that it'll land heads most of the time?" Being naturally suspicious, you respond, "Looks like he lied to you then. He probably just sold you a normal coin". But your friend comes back with, "C'mon, you know that's not fair. Human language doesn't work like that. It's imprecise by its very nature. When someone says 'always' in casual conversation, they don't necessarily mean '100.000000...% of the time' with an infinite number of significant figures. Even 'normal' coins don't land heads exactly 50.000000...% of the time". Struck by your friend's rare moment of lucid articulation, you become temporarily speechless. "Besides", your friend continues, "the magician might have said that the coin 'nearly always lands heads'. I don't remember exactly".
With this new insight, you realize that your had set your priors to the wrong hypotheses at the beginning of the problem. Instead of the hypotheses that the coin to land "heads" exactly 100% of the time, or exactly 50% of the time, you should have set them to 'close to 100% of the time' and 'close to 50% of the time'. Giving the odds of P(close to 100%)/P(close to 50%) = 1:4 as before, and interpreting "close to" as a flat distribution within 2% of the given value, We get that the likelihood ratio for the coin landing "heads" is P(heads|close to 100)/P(heads|normal) = 99%:50% = 1.98:1, and for the coin landing "tails" is P(tails|close to 100)/P(tails|normal) = 1%:50% = 0.02:1. Then the value for the posterior odds after 12 heads and 1 tails is given by prior odds times likelihood ratio, and it is roughly:
1:4 * (1.98:1)^12 * 0.02:1 = 18.15:1
(This is an approximation, made by assuming that the probability distribution can be thought of as being entirely focused at the center of their interval. The actual value, 16.97:1, can be obtained by a straightforward integration over the probability distributions, but that calculation lies beyond the scope of this introductory post.)
So you don't have to abandon the "close to 100%" hypothesis along with the "exactly 100% hypothesis. The odds are still 18:1 in favor of the coin landing "heads" more than 98% of the time, against it being a "normal" coin - enough for you to be reasonably confident in believing as your friend does.
This illustrates again the advantages of using the odds form. Firstly, we again didn't have to consider other probability values for the coin landing "heads", such as 75%. We were still able to come to a reasonable conclusion without having to specify the complete set of competing hypotheses, and their probability distribution. Secondly, we were able to completely switch the class of hypotheses under consideration, without losing consistency. If we had stuck to the original form of Bayes' theorem, then we would have had to specify our prior probabilities for P(heads exactly 100% of the time) and P(heads exactly 50% of the time). To maintain our 1:4 ratio, we would assign them as 20% and 80%, taking up all 100% of our probability, because we were not thinking about other possibilities. But then, upon realizing our mistake, we would have no choice but to contradict our previous priors, and assign P(heads close to 100% of the time) and P(close to 50% of the time) some values, while going back and admitting that the chances of the coin giving exactly 50% or 100% "heads" are nearly zero. This is a problem created entirely by being unaware of the complete set of competing hypotheses.
But with the odds form, we don't have to have complete awareness. All the conclusions that we came to are still perfectly consistent with the data: there is zero chance for the coin to land "heads" exactly 100% of the time, yet it is much more likely that the "heads" probability is close to 100% than it being a normal coin. Our two sets of priors do not contradict each other either: it's quite reasonable for our prior odds to be 1:4 in both cases, because we have not specified how much of the total probability they take up. In general, I feel that it's easier to say how likely two hypotheses are relative to one another, rather than specifying the absolute probability value for a hypothesis.
I hope this convinces you of the virtues of the odds form of Bayes' theorem. This is how I use Bayes' theorem in everyday situations to sharpen my thinking: I didn't know if this one movie was going to be any good (prior odds), but upon its recommendation from a friend (likelihood ratio), I revise my opinion and are now more likely to see it (posterior odds). I didn't know whether Argentina or Germany is more likely to win the World Cup (prior odds), but upon watching Germany slaughter Brazil (likelihood ratio), I now consider Germany more likely than Argentina to win the World Cup (posterior odds). So on and so forth. Posterior odds is prior odds times likelihood ratio.
Let's consider a couple of last examples:
I don't know if Bill Gates owns Fort Knox (prior odds). But I know that he's rich, and he's more likely to be rich if the owns Fort Knox than if he does not (likelihood ratio). Therefore, given that Bill Gates is rich, he's more likely to own Fort Knox (posterior odds).
Does that reasoning sound suspicious? It should. I took it straight from the Wikipedia page on "affirming the consequent", which is a logical fallacy. But the structure of the above argument is correct according to Bayes' theorem. It follows the same structure as all of my other examples. So, has Bayesian reasoning lead to a logical fallacy? Oh no! What shall we do?
Hold that thought, while we consider our last example:
I don't know whether Einstein's theory of general relativity, or Newton's theory of gravity is correct. (prior odds). But upon considering the experimental evidence of bending of starlight observed during the 1919 solar eclipse (likelihood ratio), I now consider general relativity much more likely to be correct than Newtonian gravity (posterior odds).
You should recognize that as the event that actually "proved" general relativity to the public, and the epitome of the scientific method at work: hypotheses are judged according to their agreement with experimental observations. But this is nothing more than just straightforward Bayesian reasoning, following the same structure as all of my other examples. So, it turns out that Bayesian reasoning underlies the scientific method, by providing the logical framework for it.
What are we to make of these two last examples? Does Bayesian reasoning allow for affirming the consequent? But isn't that a logical fallacy? But doesn't Bayesian reasoning also underlie the scientific method? Does that mean that science follows a logically flawed system? What are we to make of this?
I will address these issues in my next post.