The previous article Are Fingerprints Unique? discussed the case of Brandon Mayfield, a Muslim American attorney from the Portland, Oregon area who was wrongly identified as one of the Madrid train bombers in 2004 by the FBI based on an erroneous fingerprint identification.
The Mayfield case is probably the most famous case of an incorrect fingerprint identification. The Mayfield case is an example of a “cold hit” in which a huge biometric database was searched for a possible match to an unknown fingerprint taken from a crime scene. Unlike suspects with plausible links to the crime, there was nothing specific to connect Mayfield to the crime other than the database search match.
There are subtle and serious mathematical and statistical problems with cold hits, which occur with both DNA profiling and fingerprint identification. This article explores in detail the mathematics and statistics of the cold hit problem.
The cold hit problem is closely related to a well-known problem in probability and statistics known as the birthday problem. Imagine a room full of people: Bob, Frank, Mary, Estelle, and others. Each person has a birthday: May 1, December 13, March 11, July 17, and so on.
Not knowing the birthdays of the people in the room, what is the probability that at least two people in the room have the same birthday? How many people need to be in the room for there to be an even (50/50) chance that at least two people in the room have the same birthday?
A naive and incorrect answer would be to reason as follows. There are three-hundred and sixty-five (365) days in the year. The probability that two people have the same birthday is 1/365. Therefore, the probability that at least one pair of people in a room with N people have the same birthday is about N/365. Thus the room needs about 183 people for an even chance of a match. The actual answer is twenty-three (23) people, much smaller than 183!
Let us consider the problem in detail. First, what is the probability that Bob and Frank have the same birthday? There is a 1/365 chance that Bob was born on January 1. There is a 1/365 chance that Frank was born on January 1. Thus, there is a 1/(365*365) chance that both Bob and Frank were born on January 1. There are, however, three hundred and sixty-five days in the year, so the probability that Bob and Frank were born on the same day is 365/(365*365) or 1/365.
We need to find the probability that at least one pair of people in the room (Bob and Frank, Bob and Mary, Bob and Estelle, Frank and Mary, Frank and Estelle, Mary and Estelle, and all other possible distinct pairs) have the same birthday. If there are N people in the room, there will be [tex]N(N-1)/2[/tex] distinct possible pairs of people. Each pair will have a probability of 1/365 of having the same birthday.
The probability that at least one pair of people have the same birthday is:
P = 1.0 - (Probability that the Pair Does Not Have the Same Birthday)^(Number of Distinct Pairs of People)
which is
P = 1.0 - (Number of Distinct Pairs of People)(Probility that the Pair Does Not Have the Same Birthday) or P = 1.0 - (1.0 - 1/365)^(N(N-1)/2)
It turns out that P is 0.50048, almost exactly even, for N = 23. The number of distinct pairs of people in the room is proportional to the square of the number of people in the room [tex](N(N-1)/2) [/tex], not the number of people in the room (N). Hence, it takes far fewer people in the room than one would naively expect for there to be an even chance that at least two people in the room have the same birthday.
The plot of the probability of at least two people in a room having the same birthday was generated using the two Octave scripts below: birthday.m and plot_bday.m.
Octave is a free open-source numerical programming environment that is mostly compatible with MATLAB.
birthday.m
function [p] = birthday(n, m, bTrace) % p = birthday(n [, m, bTrace]) % probability that at least one pair of members of set of N have same birthday (M days in year) % n number of people % m number of "days" in year (default value = 365) % bTrace flag to trace operation of function (default value = false) % % (C) 2011 John F. McGowan % E-Mail: jmcgowan11@earthlink.net % if nargin < 2 m = 365; bTrace = false; end if nargin < 3 bTrace = false; end p = 0.0; p_no_pair = 1.0; % probability no pair of people in the sample have the same birthday % loop over pairs of people in the sample (room full of people) % brute force % for i = 1:n % for j = i+1:n % p_pair = m*(1/m)*(1/m); % probability i and j have same birthday % p_no_pair = p_no_pair*(1.0 - p_pair); % end % end % fast number_pairs = n * (n-1)/2; p_pair = m*(1/m)*(1/m); p_no = 1.0 - p_pair; if bTrace printf("number_pairs: %d p_pair: %f p_no: %f\n", number_pairs, p_pair, p_no); fflush(stdout); end % if p_no_pair = p_no_pair*power( p_no, number_pairs); p = 1.0 - p_no_pair; end % function
plot_bday.m
% plot probability of at least two people having the same birthday % in a room full of N people % % (C) 2011 John F. McGowan, Ph.D. % E-Mail: jmcgowan11@earthlink.net % p = zeros(1,100); for i=1:100 if mod(i, 10) == 0 printf("processing %d people in the room\n", i); end p(i) = birthday(i); end printf("displaying graph"); fflush(stdout); figure(1); plot(p); title('Probability At Least Two People Have Same Birthday'); ylabel('P'); xlabel('Number of People in Room'); printf("writing plot to file prob_bday.jpg"); fflush(stdout); print('prob_bday.jpg');
What does the birthday problem have to do with fingerprint identification, DNA profiling, or other forms of biometric identification? Replace the people in the room with fingerprints or other biometric identifiers (DNA profiles, iris images, faces,...) in a database.
Replace the three-hundred and sixty-five distinct birthdays with thousands, millions or more distinct biometric identification codes derived from the fingerprint, DNA profile, iris, or other form of identification. The pairs of people with the same birthday become pairs of people with the same fingerprint or other biometric identifier: the actual criminal who commits a crime and at least one other innocent person.
What happens if a fingerprint database has 100 million people and the chance of two people having the same fingerprint (we are referring to the same partial prints such as a thumb print lifted from a crime scene) is only one in a trillion ([tex]10^{12}[/tex]).
Astonishingly, the probability of at least two people in the database having the same fingerprint is almost one (1.0). This is because there are (100,000,000)(99,999,999)/2 possible pairs of people in the database — about five quadrillion (1,000 trillion) possible pairs. Even though the probability of any two people having the same fingerprint is extremely low, at least one misidentification occurring somewhere in the system is almost certain (probability 1.0).
The FBI fingerprint database contains about 200 million people, accumulated since the 1920s, and the probability of two people having identical or indistinguishable partial fingerprints (or even all ten fingerprints) is unknown.
DNA profiles are currently claimed to have a probability of two people having the same profile of about one in ten trillion. With cold hits, with a search of a large database of DNA profiles such as are currently being collected, it is actually likely that there will be incorrect matches somewhere in the system.
Brandon Mayfield probably fell victim, in part, to the counter-intuitive statistics of the birthday problem. As the size of biometric databases collected by governments, law enforcement agencies, intelligence agencies, and private companies grows, the cold hit problem will grow — as the square of the number of entries in the databases.
If everyone, all of the nearly seven billion people on Earth, was in the databases, one could produce a list of all possible suspects based on fingerprint or other biometric identification alone. This could easily be hundreds or thousands or even more people.
How does one handle possible suspects who lack an adequate alibi and could have flown to a crime? How many of those possible suspects will have some tenuous seven degrees of separation connection to the crime? Brandon Mayfield was a Muslim American who had represented an alleged Islamic terrorist in a child custody case: a tenuous but possible connection to the terrorists responsible for the Madrid train bombings. This is the crux of the cold hit problem.
© 2011 John F. McGowan
About the Author
John F. McGowan, Ph.D. solves problems using mathematics and mathematical software, including developing video compression and speech recognition technologies. He has extensive experience developing software in C, C++, Visual Basic, Mathematica, MATLAB, and many other programming languages. He is probably best known for his AVI Overview, an Internet FAQ (Frequently Asked Questions) on the Microsoft AVI (Audio Video Interleave) file format. He has worked as a contractor at NASA Ames Research Center involved in the research and development of image and video processing algorithms and technology. He has published articles on the origin and evolution of life, the exploration of Mars (anticipating the discovery of methane on Mars), and cheap access to space. He has a Ph.D. in physics from the University of Illinois at Urbana-Champaign and a B.S. in physics from the California Institute of Technology (Caltech). He can be reached at jmcgowan11@earthlink.net.
Great analogy to the cold hits. Great read.
I’m a high school maths teacher, so, of course, I love this sort of stuff. It also gives me a great opportunity to use the birthday problem in class and to introduce its more practical twin: as a cautionary tale and also a use of mathematics in a practical setting. Thanks so much.