Big Data Flubs Donald Versus Hillary

John F. McGowan, Ph.D. Applied Math, General, Probability and Statistics November 9, 2016 5 Comments

Donald Trump’s unexpected upset victory over Hillary Clinton raises troubling questions about the use of mathematical models and “Big Data” in politics. With the possible exception of the USC Dornsife/LA Times Daybreak Election Poll, nearly all pre-election polls and surveys appear to have significantly underestimated the level of popular support for Donald Trump in the United States. Underestimated and usually by about the same amount — a full six percentage points, not a small discrepancy.

The New York Times published an article on October 12 “How One 19-Year-Old Man in Illinois Is Distorting National Polling Averages” debunking the USC/LA Times Poll. In retrospect, the “distortion” does not appear to have been a distortion at all.

At best this widespread failure of the pre-election polls suggests a systematic bias and at worst deliberate fraud to manipulate the outcome of the election. Indeed, Donald Trump was widely ridiculed for suggesting that the pre-election polls — not the voting which he later suggested might be rigged as well — were being rigged against him, a claim that seems more plausible now in light of the election results.

As the discussions of the USC/LA Times poll and its competitors reveal, the 2016 pre-election polls are not simple surveys of potential voters where the raw results are reported to the public. Rather, they are adjusted in complex ways to supposedly compensate for sampling errors and other biases in the raw results. Apparently not so well in the 2016 Presidential Election.

There is a long history of pre-election polls getting the election results wrong by sometimes striking amounts. Who can forget the famous image of Harry Truman holding a copy of the arch-conservative Republican Chicago Tribune announcing Thomas Dewey as the winner of the 1948 Presidential Election?

Harry Truman, Defeated by Thomas Dewey in 1948

Questions have been raised for decades that some polls were manipulated. Usually the suggestion is that the polls are manipulated to make a candidate appear stronger than he or she actually is, in hopes of swaying the election in favor of that candidate. Whether this stratagem actually works is debatable as Donald Trump’s victory illustrates. Perhaps honesty after all is the best policy.

I did not vote for Donald Trump and find his erratic behavior and murky business connections alarming. The point of this article is not to endorse Trump or conservative claims of liberal media bias. The point is to emphasize the dramatic failure — once again — of supposedly sophisticated statistics, mathematical modeling, and Big Data in the current election.

At best these sophisticated mathematical methods failed dramatically to predict the outcome of the election. At worst, they were used as an intimidating mathematical smokescreen for unsuccessful propaganda apparently on behalf of Hillary Clinton.

We are increasingly inundated with mathematical models in modern politics. These include the models used to predict global warming. They include the controversial Value Added Models (VAM) used to evaluate, hire and fire teachers. Many other examples can be cited. Extremely powerful computers, high bandwidth networks, the proliferation of data from sensors and other devices, and a Big Data/Machine Learning craze are combining to shift public debate from open understandable arguments in English to arcane disputes about impenetrable statistics and mathematical models.

It is often extremely difficult, perhaps impossible, to evaluate these models. Global warming, for example, is a tiny effect much smaller than normal daily, seasonal, and yearly variations in temperatures. Teacher performance is difficult to evaluate due to substantial variations in students and teaching conditions beyond the control of even the best teachers.

In the case of the 2016 Presidential Election, however, we can see an example of these modern mathematical models clearly failing in real time.

About the Author

John F. McGowan, Ph.D. solves problems using mathematics and mathematical software, including developing gesture recognition for touch devices, video compression and speech recognition technologies. He has extensive experience developing software in C, C++, MATLAB, Python, Visual Basic and many other programming languages. He has been a Visiting Scholar at HP Labs developing computer vision algorithms and software for mobile devices. He has worked as a contractor at NASA Ames Research Center involved in the research and development of image and video processing algorithms and technology. He has published articles on the origin and evolution of life, the exploration of Mars (anticipating the discovery of methane on Mars), and cheap access to space. He has a Ph.D. in physics from the University of Illinois at Urbana-Champaign and a B.S. in physics from the California Institute of Technology (Caltech). He can be reached at jmcgowan11@earthlink.net.

About The Author

John F. McGowan, Ph.D.

5 Comments

Gerald Belton November 9, 2016

But the Trump campaign says they owe at least part of their success to Big Data:
https://adage.com/article/campaign-trail/trump-camp-capitalized-early-voting-data/306690/

Reply
- rmf November 10, 2016
  
  Perhaps because they used the early voting data correctly, to adjust the weighting in their models according to the real data, rather than their wishful thinking.
  
  Reply
Bill November 10, 2016

Any one that has studied surveying (polling) understands that several aspects affect the results:
– how questions are written/asked
– the sampling size and how it represents the masses (voters in this case)
– the models used

But all 3 of these can be subconsciously slanted based on the person/group responsible for each. So you can get tainted data, tainted samples and a tainted model causing abnormal predictions vs. actual outcomes. I think this election was masterfully managed early by Trump and his team. They quickly and often painted the media as being biased and lying. Drilling the message over and over into the people. The media made it easy since it was easy to find examples that proved Trump’s point. The people became very wary of any polling (we never answered any polling: phone or email based). Brexit was also similar and probably also contributed to Trump’s “stunning” victory. It has come out that Trump’s team did use big data from a consulting firm in England that understood how to ask the right questions and analyze the raw data into valuable information. As with Obama’s election in 2008 and 2012, US citizens need to see how he and his team perform once in office. The citizens will have another mid term election in 2018 and another presidential election in 2020. They can decide to continue with current admin or change it out.

Reply
John F. McGowan, Ph.D. November 14, 2016

Nate Silver of 5-30-8 commenting after the election on why the 538 model gave Trump relatively high odds of winning compared to other polls/models, although 5-30-8 also predicted Hillary Clinton as most probable winner:

https://fivethirtyeight.com/features/why-fivethirtyeight-gave-trump-a-better-chance-than-almost-anyone-else/

He does not mention the USC/LA Times poll, although “almost” presumably is an oblique concession that the USC/LA Times poll proved more accurate than 5-30-8 in this case.

Reply
Aaron Montgomery April 7, 2017

I mind this in as kind a way as possible: most of this article is factually incorrect.
(1) National polls are an attempt to measure the popular vote, not the electoral college. Using 538’s pollster-weighted aggregation model, Clinton was projected to win the popular vote by 4.6% (https://projects.fivethirtyeight.com/2016-election-forecast/). She won by 2.1%. This represents a 2.5% polling miss, not a 6% miss as claimed in this article. This is a relatively common magnitude of error in a presidential election (https://fivethirtyeight.com/features/trump-is-just-a-normal-polling-error-behind-clinton/).
(2) The Dornsife poll was not more correct than other polls, since it (like all others) was trying to capture the popular vote share. The final Dornsife poll had Trump up 3%; he lost nationally by 2%, for an error of 5%.
(3) Fivethirtyeight’s final prediction was not inappropriate and cannot be honestly cited as a reason that it got the landscape wrong. If I roll a die and remark that there’s a 2/3 chance it won’t land on 5 or 6, and then it does anyway, that doesn’t mean my analysis was incorrect.
(4) Some other poll aggregators besides 538 were likely too overconfident of a Clinton win. That’s not a failing of polling; it’s a failing of modeling, and it was not one committed by all involved in the poll aggregation game. It’s pretty unfair to call this a general failing of modeling, and extraordinarily unfair to call it a failing of polling. The polling was fine — or, at least, as fine as it ever is.
(5) Pundits were overwhelmingly wrong, but this is a separate and much less interesting problem than what’s being claimed in this article. The failure was not in the polling, but rather in a lack of deep understanding by the media and some in the modeling / aggregation business to account for correlation between polling errors in different states and the impact on the electoral college.
(6) I find it troubling that so many incorrect claims are put in this article with no citations or support. Hunting for such things could have spared the author some significant error.

Reply

Related Posts

About The Author

John F. McGowan, Ph.D.

5 Comments

Leave a Reply