# How to Remember the Key Bayes Formula in Statistics

Bayes Formula is a simple formula that gives a rule for updating the probability that a hypothesis is true given new evidence — information or data. For example, opinion polls show an increase in public belief in “global warming” during unusually hot years and a decrease in public belief in “global warming” during unusually cold years. This winter (2014) has been unusually cold and skepticism of “global warming” has grown. This is actually the change in degree of belief that Bayes Formula would predict.

Bayes Formula is usually presented in a form that is remarkably difficult to remember:

$$P(A|B)=\frac{P(A)P(B|A)}{P(B)}$$

The difficulty in remembering Bayes Formula is largely due to the traditional use of A and B as placeholders for the hypothesis and the new evidence (or data). These are the first two letters in the Roman alphabet and provide no cues or hints as to the meaning of Bayes Formula.

Using H for hypothesis in place of A and E for evidence in place of B makes it much easier for English speakers to learn and remember Bayes Formula:

$$P(H|E)=\frac{P(H)P(E|H)}{P(E)}$$

In English, Bayes Formula can now be read as an abbreviation for “the probability of the hypothesis given the evidence $$P(H|E)$$, known as the posterior probability, is equal to the probability of the hypothesis before the new evidence is considered $$P(H)$$, known as the prior probability, multiplied times the probability of the evidence given the hypothesis $$P(E|H)$$, known as the likelihood, divided by a normalizing factor, the probability of the evidence alone $$P(E)$$.”

Bayes Formula was first derived in the 18th century by Thomas Bayes and is closely associated with Bayes Theorem which is the actual derivation of the formula and the proof of its validity. Bayes Formula is often associated with Bayesian probability and statistics, but it is valid and Bayes Theorem is valid in other kinds of probability and statistics, such as “frequentist” probability and statistics.

Bayes Formula is widely used in machine learning, the field formerly known as Artificial Intelligence (AI). In particular, it is used in the state-of-the-art Hidden Markov Model (HMM) speech recognition algorithms. Hidden Markov Model is a misleading label for the algorithms, which incorporate many other methods in addition to a Hidden Markov Model. The open source Carnegie Mellon University (CMU) Sphinx speech recognition engine contains about sixty-thousand (60,000) lines of highly mathematical code in the C programming language developed by many researchers over many years.

In speech recognition, the speech recognition system often has some idea what the speaker may say next based on the context, what they have said so far. In the case of true homonymns, words or phrases that sound exactly the same, such as “to”, “too”, and “two”, speech recognition depends entirely on the context. For some words and phrases such as “ice cream” and “I scream” or “media rights” and “meteorites” that may be slightly different when spoken precisely but often sound the same in normal speech, speech recognition must rely primarily or entirely on context.

In many cases, when words or phrases are very similar such as “pit” and “bit”, speech recognition systems use Bayes Formula to combine the prior probability, the probability of the word from historical data on how frequently words such as “pit” or “bit” follow the preceding words in spoken English, with the likelihood of each possible word, the probability of the acoustic properties of the word itself given that the speaker intended to say, for example, “bit” or “pit”.

$$P(H|E)=\frac{P(H)P(E|H)}{P(E)}$$

where H is for hypothesis and E is for evidence, instead of:

$$P(A|B)=\frac{P(A)P(B|A)}{P(B)}$$

where the meaning of A and B is a mystery and it is remarkably easy to misremember the order of A and B.

The rest of this article goes into more detail about the meaning of Bayes Formula, which also makes it easier to remember and properly use this deceptively simple formula.

What is probability?

Probability is a remarkably difficult concept to define in a rigorous quantitative way. Human beings have made statements such as “this is probable” and “that is not likely” since ancient times. Attempts to express the notion of probability in rigorous and quantitative terms appear to date from the 1600’s when the earliest foundations of the mathematical theory of probability and statistics were laid during the Renaissance.

The frequentist theory of probability and statistics interprets or defines a probability as the frequency or rate at which an outcome occurs given a large, potentially infinite number of repetitions of an experiment or measurement. In frequentist statistics, one would say that the statement that a coin has a probability of one half (0.5) of coming up heads when tossed means that given a large number of coin tosses, the rate at which heads occur tends toward 0.5 as this large number of tosses tends to infinity. The frequentist theory really only works for situations such as a coin toss that can be repeated many times. This is often not true in everyday life, health and medicine, economics, finance, marketing, and many other fields.

What about the probability that a statement such as “David Cameron is the Prime Minister of the United Kingdom” is true? This is a one time measurement or experiment. It can’t be repeated in any way. Nonetheless, I would assign a high probability, say 99 percent or higher, to this statement because I recall seeing David Cameron mentioned in many headlines and articles as if he was the Prime Minister of the UK (I don’t follow British politics so I actually wasn’t certain Cameron was the Prime Minister until I wrote this article and checked carefully). In everyday conversation, words like “probability” or “likelihood” and phrases such as “I am ninety-five percent certain” are often used in this way, expressing a degree of belief, rather than a frequency of occurrence. Bayesian statistics, which has enjoyed a strong revival in the last twenty years, defines or interprets probability as “a degree of belief”. Bayesian statistics still uses a number from 0.0 to 1.0 but purports to be able to make rigorous quantitative judgments about the probability of statements such as “David Cameron is the Prime Minister of the United Kingdom” about which frequentist statistics makes no claims.

Bayes Formula and Bayes Theorem are valid under both frequentist and Bayesian statistics. In cases where there is a fully repeatable experiment or measurement, such as tossing a coin, and there is plenty of historical data available and this historical data is used for the so-called prior probability $$P(H)$$, Bayes Formula will yield identical results in both theories of probability and statistics. But with hypotheses such as “David Cameron is the Prime Minister of the United Kingdom” or “global warming is true,” which do not involve fully repeatable experiments such as tossing a coin, Bayesian statistics can make quantitative predictions using Bayes Formula where frequentist statistics essentially throws up its hands and walks away. This often involves making an educated guess based on personal experience and intuition for the prior probability $$P(H)$$ in Bayes Formula. In my example, I used a probability of 99 percent based on my personal experience of reading articles. Once I checked Google, Wikipedia, and a range of other sources, I updated this probability to essentially 1.0. Notice this use of a numerical estimate for the probability as a degree of belief in Bayesian statistics and in everyday conversation is decidedly subjective and generally difficult to justify.

In modern quantitative theories of probability, probability is usually quantified as a number from 0.0 to 1.0. In frequentist statistics, 0.0 means “never happens” (a rate of occurence of zero) and 1.0 means “always happens.” In Bayesian statistics, 0.0 means something like “flatly untrue” and 1.0 means something like “absolutely certain.”

What is a conditional probability?

A conditional probability is a key concept for understanding Bayes Formula. A conditional probability, often represented as P for probability followed by the left parenthesis H vertical bar for given and E right parenthesis — $$P(H|E)$$ — is the probability something,e.g. H for a hypothesis, is true given that something else is true, e.g. E for evidence.

A concrete example:

Let H, our hypothesis, be “Mr. X is an American citizen (citizen of the United States of America)”. The United States had a population of 313.9 million on February 2, 2014. The total population of the world on February 2, 2013 was 7.21 billion. Therefore, without any additional information available, the probability that Mr. X is a US citizen is:

$$P(H) = 313.9/7210.0 = 0.043537$$

However, what if we get some additional information, E. Mr. X is a member of the US Congress. What is the probability that Mr. X is a US Citizen given that he is a member of the US Congress? Of course, US law requires that members of US Congress be US citizens. Therefore, the conditional probability that Mr. X is a US citizen is 1.0.

$$P(H|E) = 1.0$$

A conditional probability can be very different from a regular probability.

The probability that both H and E are true can be expressed in terms of conditional probabilities:

$$P(H \cap E) = P(H) P(E|H)$$

$$P(H \cap E) = P(E) P(H|E)$$

The symbol $$\cap$$ in the expression $$H \cap E$$ means the intersection of the sets $$H$$ and $$E$$. This is a fancy way of saying both the hypothesis $$H$$ and the evidence $$E$$ are true.

Notice, one can combine these two equations to get:

$$P(E) P(H|E) = P(H) P(E|H)$$

If one divides through by $$P(E)$$, the new equation becomes:

$$P(H|E)=\frac{P(H)P(E|H)}{P(E)}$$

This is Bayes Formula! I have just derived Bayes Formula.

What is likelihood?

The probability of the evidence given the hypothesis $$P(E|H)$$ in Bayes Forumula is known as the likelihood. In everyday English, “probability” and “likelihood” are used interchangeably, often to express a degree of belief as in Bayesian statistics rather than a frequency of occurrence, although both meanings are used in conversational English. In the mathematical theory of probability and statistics, likelihood has a special technical meaning distinct from common English usage. Probability and likelihood are not fully interchangeable terms, synonyms, in the mathematical theory. Likelihood refers to the probability of evidence or data given a particular hypothesis. It is never used, for example, to refer to the probability of a hypothesis $$P(H)$$ or the probability of the hypothesis given the evidence $$P(H|E)$$.

The famous statistician Ronald Fisher built much of his system of probability and statistics around this technical concept of likelihood. Much of his work is based on the concept of “maximum likelihood” or “maximum likelihood estimation,” which generally refers to finding the hypothesis H that maximizes the likelihood $$P(E|H)$$.

Likelihood can give misleading and counter-intuitive results. It is easy to confuse the likelihood $$P(E|H)$$ with the probability of the hypothesis given the evidence, the so-called posterior probability in the language of Bayes Formula. Most often, it is the probability of the hypothesis given the evidence that we want to know: is David Cameron really the Prime Minister of the United Kingdom? Is global warming really true?

In the Member of Congress example, what is the probability that Mr. X is a Member of Congress — the evidence E — given that he is a US citizen. There are 319 million US citizens and only 535 voting members of Congress (counting both Representatives and Senators). The likelihood is:

$$P(E|H) = 535/319,000,000 = 0.000001677$$

Notice the remarkable fact that although the likelihood $$P(E|H)$$ is tiny, the posterior probability $$P(H|E)$$ that Mr. X is a US Citizen given the evidence that he is a member of Congress is 1.0, much larger.

What is the probability of the evidence (the normalizing factor)?

The normalizing factor $$P(E)$$, the probability of the evidence, is very important in some cases such as the Member of Congress example. In the Member of Congress example, there are two competing hypotheses: “Mr. X is a US Citizen” and “Mr. X is not a U.S. Citizen”. Let’s call these hypothesis zero $$H_0$$ and hypothesis one $$H_1$$. The evidence E is that Mr. X is a Member of Congress.

Using Bayes Formula, the probabilities of $$H_0$$ and $$H_1$$ are:

$$P(H_0|E)=\frac{P(H_0)P(E|H_0)}{P(E)}$$

and

$$P(H_1|E)=\frac{P(H_1)P(E|H_1)}{P(E)}$$

Since there are only two possible hypothesis (either Mr. X is a US Citizen or he is not), the probabilities should sum to one (1.0):

$$P(H_0|E) + P(H_1|E) = 1.0$$

In this case, the probability of the evidence, the normalizing factor, is:

$$P(E) = P(H_0)P(E|H_0) + P(H_1)P(E|H_1)$$

Mathematicians frequently use the terms “normalize” or “normalizing” to refer to the process of scaling the terms in a sum or the function in an integral so that the sum is one (unity) or the integral is one (unity).

Notice, in this case, the probability that Mr. X is a Member of Congress given that he is not a US Citizen $$P(E|H_1)$$ is zero (0.0). The probability of the evidence, the normalizing factor $$P(E)$$ , is simply:

$$P(E) = P(H_0)P(E|H_0) = 0.043537*0.000001677$$

In this case, the probability of the evidence, that Mr. X is a Member of Congress is:

$$P(E) = 535/7,210,000,000$$

The probability of the evidence, the normalizing factor, is just the tiny probability that Mr. X, of all the seven billion plus people on Earth, is a Member of the US Congress which has only 535 members. This is very unlikely, but given that someone is a Member of Congress, they will be a US citizen, even though it is also unlikely that someone selected at random will be a US citizen.

In general, given a complete, exhaustive set of hypotheses $$\{H_i\}$$, the probability of the evidence, the normalizing factor, is the sum over all hypotheses of the probability of the hypothesis times the probability of the evidence given that hypothesis:

$$P(E) = \sum_i P(H_i)P(E|H_i)$$

where $$\Sigma$$ is the Greek letter sigma used to represent a sum and i is the index over all possible hypotheses.

In a continuous case, the sum may be replaced by an integral over a continuous parameter or set of continuous parameters defining each hypothesis. This is a more advanced case that goes beyond the level of this article.

A Glaring Weakness

Bayes Formula has a glaring weakness. What happens if I assign a prior probability of zero $$P(H) = 0.0$$ to a hypothesis? Let’s say, for example, that I firmly believe it is utterly impossible for space aliens to visit the Earth. Can’t happen. Never has happened. Never will happen. Einstein, the lightspeed barrier in special relativity, our knowledge of physics, clearly shows no one could build a space ship capable of traveling from even the nearest star Alpha Centauri to Earth.

Tomorrow little gray space aliens with big black almond-shaped eyes land in their flying saucers on the White House lawn, the main quad at Harvard, Bill Gates mansion, and hundreds of other locations all over the Earth. It is all over CNN, CNBC, FOX, Slashdot and Hacker News. They even demonstrate a technology that allows them to float through walls, quite obviously beyond any classified military technology that could exist. They say they are from Zeta Reticuli and they are here to help. At this point, most of us would agree space aliens exist and can visit the Earth, Einstein and special relativity notwithstanding.

What does my Bayes Formula calculation tell me?

$$P(H|E)=\frac{P(H)P(E|H)}{P(E)}$$

Given my prior probability of zero, the probability that space aliens can visit the Earth given a mass landing at hundreds of locations all over the planet is … zero. Yes. ZERO. Yes, that is ZERO with a Z. Zero times anything is still zero. Science and rigorous mathematics have spoken. Evidence? We don’t need no stinking evidence!

This is the “Cromwell’s Rule” problem after Oliver Cromwell’s famous plea:

I beseech you, in the bowels of Christ, think it possible that you may be mistaken.

The solution (modern day epicycle?) is never to assign a prior probability of zero. Even for extremely unlikely hypotheses, use a small, non-zero prior probability. The obvious problem with this solution is what tiny, but non-zero probability to use: one in a million, one in a billion, one in a googol, one in a googolplex, or even smaller? There are some advanced methods, such as Good-Turing smoothing, for estimating what this tiny number should be.

Conclusion

$$P(H|E)=\frac{P(H)P(E|H)}{P(E)}$$

where H is for hypothesis and E is for evidence, instead of:

$$P(A|B)=\frac{P(A)P(B|A)}{P(B)}$$

where the meaning of A and B is a mystery and it is remarkably easy to misremember the order of A and B. Some readers may find H for hypothesis and D for data works better for them. Ultimately use what works best for you; A and B rarely work well.

Acknowledgement

This article owes a lot to Allen Downey’s excellent on-line articles and presentations on Bayes Formula and Bayesian statistics. Naturally, any errors in this article are the author’s responsibility.

© 2014 John F. McGowan