A practical understanding of probability and statistics at an advanced, at least college, level is increasingly important in the modern world. For example, many expensive and potentially hazardous drugs including chemotherapy for cancer and anti-cholesterol drugs such as Lipitor are approved for use and justified to patients based on complex statistical studies. Children are being increasingly medicated for a range of alleged psychiatric disorders such as Attention Deficit Hyperactivity Disorder (ADHD or ADD), bipolar disorder, and others. Many questions have arisen about the seeming epidemic of autism (see the recent article The Mathematics of Autism).
Important public policy issues such as “global warming” hinge on complex mathematical models and statistics. The public is often swayed by shocking statistics widely repeated such as “the Soviet Union is producing two to three times as many engineers and scientists as the United States” (1950’s), “one million missing children” (1980’s), and “drugs cost $800 million dollars” to research and develop.
Complex mathematical and statistical models for mortgage backed securities played a major role in the financial crash in 2008 and the housing bubble. The financial system continues to rely on these so-called derivative securities despite numerous costly failures.
On the positive side, free open source tools with powerful statistical capabilities are widely available such as GNU Octave and the R programming language. More and more data is available in accessible formats such as comma separated values (CSV) files, tab-delimited files, or Excel spreadsheets. LibreOffice is a free open source program that can read most Excel spreadsheet formats. More information on probability and statistics is available at Wikipedia and other online sources. The National Institutes of Health, to its credit, is, for now, attempting to make research data and papers funded by the NIH openly available. Many other research programs seem to be trying to do this. Hopefully these trends will continue.
Formal college level education in probability and statistics tends to focus on idealized situations such as flipping a fair coin, games of chance at a fair casino, and highly idealized laboratory experiments in “hard sciences” such as physics that lack many of the actual difficulties encountered in frontier research or real world data in “softer” fields such as economics, finance, medicine, biology, psychiatry, marketing, and so on. In many real world situations, the major problems encountered, including issues such as how the data is collected and how the numbers are defined, differ from typical textbook accounts of probability and statistics.
This article discuses the pitfalls and gotchas of probability and statistics in practice.
Averages, Medians, and Distributions
Averages can be highly misleading. For example, these two sequences of ten numbers have the same average value — ten (10):
octave-3.2.4.exe:10> a = [10 10 10 10 10 10 10 10 10 10]; octave-3.2.4.exe:11> mean(a) ans = 10 octave-3.2.4.exe:12> median(a) ans = 10 octave-3.2.4.exe:13> b = [1 1 1 1 1 1 1 1 1 91]; octave-3.2.4.exe:14> mean(b) ans = 10 octave-3.2.4.exe:15> median(b) ans = 1
The average or arithmetic mean is the sum of all the numbers in the sequence divided by the number of values. The median is the value in the sequence or the average of two neighboring values when ordered in increasing value such that there is an equal number of elements of the sequence greater than than the median value and number of elements less than the median value.
The median is an example of a robust statistic that is less susceptible to misleading outliers in the data. It is often better to look at the median instead of the average, especially with noisy real-world data.
The median can also be misleading. These two sequences have the same median — ten (10) — but are quite different.
octave-3.2.4.exe:10> a = [10 10 10 10 10 10 10 10 10 10]; octave-3.2.4.exe:11> mean(a) ans = 10 octave-3.2.4.exe:12> median(a) ans = 10 octave-3.2.4.exe:21> b = [0 0 0 0 10 10 100 100 100 100]; octave-3.2.4.exe:22> median(b) ans = 10
In the first case, the median value ten is highly representative of the typical value in the sequence. In the second case, the spread of values is very high and the median is misleading about the typical values — zero and one-hundred.
Any single statistic such as the average, median, or mode (most common value in the data) can be misleading depending on the underlying distribution of the sequence and the context in which the statistic is used.
No matter how convincing a statistic may seem, it is best to examine the distribution of the underlying data.
Outliers and the Bell Curve
The Gaussian, also known as the Normal Distribution or Bell Curve, is very heavily used, often improperly, in statistics.
[tex]P(x) = \frac{1}{{\sigma \sqrt {2\pi } }}e^{{{ {-} \left( {x {-} \mu } \right)^2 } \mathord{\left/ {\vphantom {{ {-} \left( {x {-} \mu } \right)^2 } {2\sigma ^2 }}} \right. \kern-\nulldelimiterspace} {2\sigma ^2 }}}[/tex]
The Gaussian is taught in almost all introductory probability and statistics, at least at the college level. There is a theorem, known as the Central Limit Theorem, that the average of a sequence of independent identically distributed (IID) variables converges to the Gaussian distribution as the number of variables in the sequence (N) tends to infinity.
This is some data generated according to the Gaussian/Normal Distribution/Bell Curve with mean [tex]\mu[/tex] 0.0 and standard deviation [tex]\sigma [/tex] of 1.0.
The Gaussian/Normal/Bell Curve is very heavily used in mathematical models today. However, despite the Central Limit Theorem, many real-world distributions are not Gaussian and have long tails. The data often contains outliers.
Several mathematical models used in quantitative finance such as the famous Black-Scholes Option Pricing Model use the Gaussian distribution. They often assume the returns for a financial asset are distributed according to a Gaussian distribution. Historical data shows that the returns for many financial assets do not have a Gaussian/Normal/Bell Curve distribution and often contain extreme “fat tail” outliers such as market crashes. Mathematical models using a Gaussian distribution tend to underestimate the risks of financial assets.
Statistical Significance
Statistical significance can be a treacherous concept. Statistical significance is often reported as something known as a p value. The p value usually refers to the probability that the data, set of measurements, could have been due to pure chance. The lower the p value, the greater the statistical significance of a result.
Consider flipping a coin. The probability that five heads will appear by chance in a row is:
[tex] (\frac{1}{2})(\frac{1}{2})(\frac{1}{2})(\frac{1}{2})(\frac{1}{2}) = (\frac{1}{32})[/tex]
or 3.125 percent (0.03125).
This is less than five percent. Many scientific journals accept papers that report a p value of five percent or less for their results. The p value is often interpreted as meaning there is a [tex]1 {-} p[/tex] probability that the hypothesis being tested is correct, but that is not really correct.
Keep in mind that people flip coins and get five heads (or five tails) in a row all the time. With a p value of only five percent, one in twenty published papers reporting a p value of five percent will be wrong purely by chance.
Would you live in a house that had a five percent chance of collapsing on you? Drive over a bridge that had a five percent chance of collapsing as you cross the bridge? Probably not. Even though ninety-five percent seems high and is typically an A in classroom homework, it is not a very high level of confidence in the real world.
The p value also tells you nothing about whether the “statistically significant” effect was due to the hypothesis being tested or the cause suggested by the authors of a scientific paper or study. Quite a number of studies in parapsychology (ESP, etc.) have produced impressive levels of statistical significance. Is this due to the hypothesized paranormal cause, sophisticated cheating, or some other unknown cause. Something else is very difficult to rule out.
Statistical significance is not the same as the strength of an effect. For example, drug A might have an effect of 1.0 on some scale whereas drug B has an effect of 1.0000001, a negligible improvement in practice, but the statistical significance of this result could be extremely high. The p value could be one in a trillion. One may be very confident of a tiny, unimportant difference.
In some fields such as experimental particle physics, there is skepticism about the interpretation of the p value or equivalent measures of statistical significance. This is because many results that have been reported with very low p values nonetheless could not be replicated. In some cases, such as the pentaquark, several different research groups reported the same or a similar effect which ultimately “went away.”
Systematic Errors
Probability and statistics says little about systematic errors. The OPERA experiment’s spurious report of faster than light neutrinos was due to a systematic error in measuring time delays, very tiny time delays. The results was statistically significant but incorrect for other reasons.
Correlation and Causation
Correlation does not prove causation. There are many statistical methods and single statistics (number) that measure whether two or more measurements are correlated. Even if A and B are perfectly correlated, this can mean A causes B, B causes A, A and B share a common cause, or even certain kinds of chance occurrences.
Common Correlation Coefficients in GNU Octave
octave-3.2.4.exe:8> data = randn(1, 100); octave-3.2.4.exe:9> data2 = 2.0*data; octave-3.2.4.exe:10> corrcoef(data, data2) ans = 1.0000 octave-3.2.4.exe:11> data3 = randn(1,100); octave-3.2.4.exe:12> corrcoef(data, data3) ans = -0.080590 octave-3.2.4.exe:13> kendall(data, data2) ans = 1 octave-3.2.4.exe:14> spearman(data, data2) ans = 1 octave-3.2.4.exe:15> kendall(data, data3) ans = -0.028283 octave-3.2.4.exe:16> spearman(data,data3) ans = -0.049889 octave-3.2.4.exe:17>
In the GNU Octave code above, randn generates random data with the normal distribution with mean 0.0 and standard deviation 1.0. data and data2 are perfectly correlated since data2 is exactly two times data. data and data3 are uncorrelated. The function corrcoef computes Pearson’s correlation coefficient, the most commonly used correlation coefficient. Frequently, this is what is used to say two data sets are correlated. The functions kendall and spearman implement other, less commonly used correlation coefficients.
Even though most scientists, mathematicians, and statisticians are taught that correlation does not prove causation, it is common to find this disregarded in practice, especially in biology and medicine. Many prominent theories in biology and medicine are based, on close examination, on a correlation, perhaps a very strong correlation, but only a correlation.
Beware of the use of language such as “the link between A and B” or “the relationship between A and B” used as if “link” or “relationship” means A causes B (or B causes A). Link and relationship are very general terms. If A and B are correlated, one can honestly say there is a “link” or “relationship” between A and B, even though causation is not actually proven by a correlation.
Categories and Definitions
By far the greatest and most common problem with using probability and statistics in the real world lies in the definition of terms, categories, and measured values. When counting the number of engineers produced by the United States, the Soviet Union in the 1950’s, China, or other nations, what is an engineer? What is a missing child in “one million missing children?” What does it mean to say someone has been cured of cancer or has survived cancer? What is autism?
An engineer can be: someone with a B.S. in an engineering discipline, someone licensed to practice as an “engineer” by a government body, a Ph.D. in an engineering discipline, an A.A. in an engineering discipline, a technician with a high school diploma or GED, an enthusiast with an 8th grade education like Orville and Wilbur Wright, a civil engineer, an electrical engineer, a “software engineer,” a computer programmer, a medical technician, a nurse, an agricultural technician and so on.
In the 1950’s and 1960’s, Soviet expert Nicholas DeWitt used a broad definition of scientists and engineers to argue that the Soviet Union produced two to three times as many scientists/engineers as the United States, by amongst other things including engineers receiving correspondence degrees, medical workers including nurses, and agricultural workers in his total (see MIT Historian David Kaiser’s article The Physics of Spin: Sputnik Politics and American Physicists in the 1950s).
A missing child can be a teenager who runs away from home after an argument for a few hours. A missing child can be a child who leaves voluntarily, but illegally, with a non-custodial parent. A missing child can be a child abducted by a non-custodial parent. A missing child can be a long term runaway or “throwaway.” A missing child can be a child abducted and killed by a psychopath. In the 1980’s, even to the present day occasionally, the statistic “one million missing children” (even larger numbers were sometimes cited) was used to imply the latter. Fortunately, most reported missing children cases involve short term runaways or parental custody cases, certainly cause for concern in some cases but not an epidemic of homicide or stranger abductions.
In the medical literature, being “cured” of cancer or “surviving” cancer often means living for at least/no more than five years after being diagnosed with the disease. This differs dramatically from common English usage of the words “cured” and “survive.” Since cancer is often a slow progressing disease — many people with untreated cancer will live at least five years — this practice is particularly misleading.
The statistics on the prevalence of autism from the United States Centers for Disease Control (CDC) are extremely difficult to interpret due to the vague and broad definition of “autism spectrum disorders,” a situation the CDC has done little to resolve despite many years and billions of dollars in funding for autism research.
These definitional issues are rarely discussed, usually briefly if at all, in introductory college level textbooks on probability and statistics. These textbooks deal with very clean, well defined situations such as flipping an idealized perfectly fair coin. Heads is well defined and unambiguous. Tails is equally well defined and unambiguous. There is no question that the coin has an equal chance of coming up heads or tails. There is no cheating.
In public policy debates, scientific controversies, and other real-world applications of probability and statistics issues about how the data were collected, how the terms and values are measured and defined, and what the categories used actually mean often take center stage and are the subject both of bitter controversy and simple confusion. It often requires extensive research to resolve these issues; often they are not resolved, certainly to the satisfaction of all.
Conclusion
A good understanding of probability and statistics is increasingly necessary in the modern world. There are many ways to misuse probability and statistics, both intentionally and by accident. One should almost never take a statistic at face value, especially when powerful vested interests are at stake. The best course of action is to examine the data and the analysis of the data carefully. Unfortunately, this is often time consuming, but there is no substitute for important issues.
© 2012 John F. McGowan
About the Author
John F. McGowan, Ph.D. solves problems using mathematics and mathematical software, including developing video compression and speech recognition technologies. He has extensive experience developing software in C, C++, Visual Basic, Mathematica, MATLAB, and many other programming languages. He is probably best known for his AVI Overview, an Internet FAQ (Frequently Asked Questions) on the Microsoft AVI (Audio Video Interleave) file format. He has worked as a contractor at NASA Ames Research Center involved in the research and development of image and video processing algorithms and technology. He has published articles on the origin and evolution of life, the exploration of Mars (anticipating the discovery of methane on Mars), and cheap access to space. He has a Ph.D. in physics from the University of Illinois at Urbana-Champaign and a B.S. in physics from the California Institute of Technology (Caltech). He can be reached at jmcgowan11@earthlink.net.
Suggested Reading/References
How to Lie with Statistics
Darrell Huff
Using Murder: The Social Construction of Serial Homicide
By Philip Jenkins
This book about a depressing topic is somewhat pedantic but has some good discussions of the use and misuse of crime statistics for serial killers and murders in the 1980s.
The $800 Million Pill: The Truth behind the Cost of New Drugs
By Merrill Goozner
A critical look at the claims that drugs cost an average of $800 million to research and develop, paid by pharmaceutical companies.
Toil, Trouble, and the Cold War Bubble: Physics and the Academy since World War II
David Kaiser’s Presentation at the Perimeter Institute on the Cold War Physics Bubble
Includes a detailed discussion of how Nicholas DeWitt’s Scientist and Engineer Production Numbers were used and abused during the Cold war.
When Genius Failed: The Rise and Fall of Long-Term Capital Management
By Roger Lowenstein
A dry run for the current financial crisis with a good, non-technical discussion of the fat tails problem in quantitative finance.
I’m glad more people are pointing out the flaws in p values, but I think your explanation is in error. Specifically, you say:
“Keep in mind that people flip coins and get five heads (or five tails) in a row all the time. With a p value of only five percent, one in twenty published papers reporting a p value of five percent will be wrong purely by chance.”
That is only true if all the studies have a statistical power of 1 (which never happens) and exactly 50% of tested hypotheses are true, which is also unlikely. A common scenario in experimental science is a statistical power of 0.5 and only 10% of tested hypotheses being correct, in which case 45% of papers reporting a p value of 0.05 will be wrong purely by chance.
I’ve written some more examples here:
https://www.refsmmat.com/statistics/#the-p-value-and-the-base-rate-fallacy
Statistical power is too often neglected. Many studies which claim to have found no significant difference between two groups (e.g. control group and medicated group) actually do not have the power to detect differences between groups with statistical certainty, because their sample sizes are too small.
This is a nice article, but there is one problem I found. The discussion of Black-Scholes states that because Black-Scholes assumes that returns are normally distributed, and actual returns are not, that Black-Scholes underestimates the risk associated with financial assets.
This is not strictly true, because Black-Scholes does not use the actual or “real” returns distribution. Black-Scholes is a risk-neutral valuation technique, meaning that it prices assets in a hypothetical world where investors do not care about risk (which is clearly easier to do). This means that Black-Scholes does not assume that the returns we observe in the real world will be normally distributed, only that returns in a risk-neutral world are.
In the case of most risk-neutral pricing methods, a solution (once you have one) can be shown to be same under most or even all attitudes towards risk–it’s easier to derive solutions in a risk-neutral setting and then show that the same solution applies in our own.
The short version is that the normality assumption in Black-Scholes is not necessarily a problem in and of itself.
Firstly in the cancer literature cure or survival (explicit) is never or rarely (in the case of people who genuinely appear to undergo full regression), the common terminology is progression free survival and is measured in years.
Secondly statistics in the biomedical field has ever been used to prove a hypothesis. Statistical analysis can only ever hope to try disprove the that whatever your looking at has no/the opposite effect (the null hypothesis).
Throughout my studies in two different institutions in the UK I was taught a solid foundation in statistics, while I will never (or cannot hope) to be an expert in the field – I don’t want to be. However I was given enough of a foundation to be able to interpret what a statistic actually means in a paper, the literature isn’t written for general consumption and any conclusions given are there for interpretation by the writers peers.
The problem you seem to be describing is people without enough expertise in the domain, or (as is common in academia) without the ability to communicate complex findings to lay people trying to interpret scientific literature.
Papers that communicate results with an clear misunderstanding of statistics, or represent their data misleadingly should be easily discredited by anyone with a degree level understanding of the respective domain. Indeed that process is the basis of current scientific understanding in any domain. Misrepresentation (purposeful or otherwise) by third parties is an inevitability
The Author Responds
See this recent paper, for example:
Yavchitz A, Boutron I, Bafeta A, Marroun I, Charles P, et al. (2012) Misrepresentation of Randomized Controlled Trials in Press Releases and News Coverage: A Cohort Study. PLoS Med 9(9): e1001308. doi:10.1371/journal.pmed.1001308
https://www.plosmedicine.org/article/info%3Adoi%2F10.1371%2Fjournal.pmed.1001308
and also the coverage of the recent ENCODE “junk DNA” results.
What is more common to encounter than explicitly improper use of statistics in the peer reviewed literature is that the peer reviewed articles or books qualify the language, terms and definitions, more or less correctly, sometimes in ways that mean the terms and phrases used, without the qualifications, have a meaning that differs greatly from common English usage — and is therefore likely to be highly misleading.
This is particularly the case with cancer survival rates, for example. “Cancer survival rate” has a very different meaning in common English usage from “Cancer survival rate after only five years.”
As the PLOS paper above illustrates, abstracts, press releases, informal statements by authors, official reports by funding agencies, and so on often contribute, intentionally or not, to a misleading interpretation or meaning of the statistics or findings in the press and/or general public. This may well be innocent in some cases, but it is, in fact, sadly common in biology and medicine and researchers, certainly at the principal investigator level, should know how to use the English language as it is used in common usage.
CERN’s announcement of the maybe-Higgs/maybe not Higgs “observation” recently constitutes a recent high profile example of this sort of weasel-wording in experimental particle physics.
https://press.web.cern.ch/press/PressReleases/Releases2012/PR17.12E.html
Sincerely,
John
Opinionated Lessons in Statistics: #14 Bayesian Criticism of P-Values (VIDEO – about 20 minutes)
14th segment in the Opinionated Lessons in Statistics series of webcasts, based on a course given at the University of Texas at Austin by Professor William H. Press.
https://www.youtube.com/watch?v=IKV6Pn18C7o
Sincerely,
John
What does eighty percent (80%) mean? A recent example of murky statistics in science:
https://arstechnica.com/staff/2012/09/most-of-what-you-read-was-wrong-how-press-releases-rewrote-scientific-history/
https://blogs.nature.com/news/2012/09/fighting-about-encode-and-junk.html
As is often the case in real-world applications of statistics the issue revolves around the definitions of terms and categories.
John
See Common Errors in Statistics at https://www.amazon.com/Common-Errors-Statistics-Avoid-Them/dp/1118294394
An interesting blog post on statistics:
Why aren’t we doing the maths?
The practical implications of misplaced confidence when dealing with statistical evidence are obvious and worrying
https://timharford.com/2012/10/why-arent-we-doing-the-maths/