Bettman/Getty Images

When Big Data Bites Back

by edward tenner

edward tenner, a research affiliate of the Smithsonian Institution and Rutgers University and author of The Efficiency Paradox: What Big Data Can’t Do, is currently a Visitor in the Program in Interdisciplinary Studies at the Institute for Advanced Study at Princeton.

Published July 6, 2020


Could Donald Trump be right about deliberate bias in political polling — a charge he recently raised about a CNN poll showing him 14 percentage points behind Joe Biden? Steven Shapiro, writing for Politico, rejects Trump’s charge that the numbers were cooked. But he raises a question that goes beyond politics to the heart of a data-driven economy and society: Can assumptions lead even the best qualified experts to misleading conclusions when interpreting statistics?

A Twice-Told Tale

The public opinion industry had its Titanic moment in 1948, when the arch-Republican Chicago Tribune called the presidential election based on polls, leading to the iconic photograph of a beaming Harry Truman displaying the premature headline, “DEWEY DEFEATS TRUMAN.” (Like the Titanic shipwreck itself, the headline was the consequence of multiple unlikely circumstances — in this case, a newspaper typesetter strike that pressured editors to set headlines hours earlier than usual.)

A forthcoming book, Lost in a Gallup, by W. Joseph Campbell, shows how the 1948 polling catastrophe echoed the distortions in sampling that sank the almost-as-famous Literary Digest mail-in poll, which had been accurate in elections since 1920 but missed by a mile in the 1936 presidential contest between the incumbent Franklin Roosevelt and Alfred W. Landon. Campbell shows how other errors have bedeviled the polling industry since 1936 despite decades of advances in computer power and statistical sophistication.

It is difficult to read Campbell’s book without empathizing with the pollsters and (crucially, as Campbell observes) the journalists and pundits interpreting and amplifying their conclusions. Every means of sampling, whether by telephone or internet, introduces biases. Response rates unexpectedly decline. Predictions of who will actually vote become more complex — and are likely to be even more so in 2020, with the future of mail-in ballots and the safety of voting at polling stations uncertain. It may be a great accomplishment that polls do not fail more often.

2016 and All That

But the 2016 and possibly the 2020 elections are different. As Shapiro observes in its postmortem of the 2016 failure, the Association for Public Opinion Research identified an Achilles heel of the polls. In hindsight, the blunder seems obvious: polls in decisive battleground states did not weight their samples for education levels, probably because until this campaign, education was not considered an important variable. White voters without college degrees were thus undercounted.

It is unknown how many pollsters and the journalists who interpreted their results had watched Donald Trump’s reality TV show, The Apprentice, which elevated Trump from vulgar New York eccentric to national brand. While ratings dropped steadily after the first season, the NBC program was still must-see TV for some 10 million Americans.

If the political gurus had watched The Apprentice faithfully, they might have noticed a favorite theme of the host: the rivalry of college-educated professionals against self-made high school grads. The third season’s teams represented book smarts (Team Magna) against street smarts (Team Net Worth). The winner, Kendra Todd, was a Magna member with an undergraduate degree in linguistics. But Trump still had made his point that the losers were higher earners as a group than the winners and that there was more than one path to success. The season was the only one nominated for an Emmy.

Donald Trump has alternated between celebrating his undergraduate degree from the University of Pennsylvania’s Wharton School and questioning the value of the business curriculum when weighed against his “genius.” What detractors consider vainglory has seemed to Trump’s partisans the power of positive thinking in the self-help tradition of Norman Vincent Peale in which Trump was raised — a creed scorned by academia but still followed by millions.

Every means of sampling, whether by telephone or internet, introduces biases. Response rates unexpectedly decline. Predictions of who will actually vote become more complex.

Thus, there were clues in plain sight of what might influence voters with Trump on the ballot. It was also common knowledge that the outcome in the Electoral College could turn on a handful of states, as it had in 2000 when a Supreme Court decision on Florida’s near dead heat decided the election.

The most serious error of all may have been to assume that the statistical errors in state polls were uncorrelated. In retrospect it seems obvious that the failure to factor in educational background meant that as goes Wisconsin, so goes Michigan and Pennsylvania.

Devil’s in the Details

The lessons of the 2016 polling debacle go well beyond politics. They fall into three categories. The first is what statisticians call Simpson’s paradox. If we ignore certain variables — not necessarily clear in advance — we can draw false conclusions.

One of the most famous examples is a study of graduate school admissions at the University of California at Berkeley. Women seem to have been at a great disadvantage; 35 percent of female versus 44 percent of male applicants were selected. But when results were analyzed by department, they revealed a striking distortion. Women applied disproportionately to humanities and social science programs, which had relatively low acceptance rates for men as well as for women. In natural science programs, with acceptance rates over 50 percent, women actually were admitted at a higher rate than men, but the departments were much smaller than those in the humanities and social sciences. Education appears to have been a neglected variable in key state polls.

The second data distortion was the bandwagon effect. This was the fault not only of the pollsters but perhaps even more of the journalists and opinion writers who interpreted the polls. As one pundit after another forecast a Clinton victory, it became more daring to call the contest close, let alone predict that Trump would squeak by in the Electoral College.

London bookie websites, with no emotional ties to either campaign, were giving odds strongly favoring Hillary Clinton; they lost tens of millions of pounds. An international conference of astrologers fared no better. Even the positive-thinking Trump, as he later acknowledged to an audience in Wisconsin, seemed ready to throw in the towel.

The third challenge to polls is the difficulty of framing questions to capture robust attitudes. A data analyst at the University of California at Berkeley has found that when asked about immigration in general, opinion is favorable, yet the same respondents may feel threatened by immigration in their immediate surroundings.

It should not be entirely surprising, then, that the losing side in the 2016 election had the more sophisticated and better funded data analysis operation, despite the supposed prowess of Donald Trump’s creepy social media experts at Cambridge Analytica. Politico ran an admiring story on Hillary Clinton’s guru and his formidable “data-based” campaign that some Republicans feared could set them back for multiple election cycles.

Where Angels Fear to Tread

The failure of 2016 and concerns about 2020 do not mean that we should reject data science. They do suggest, though, that with the proliferation of data — and the rise of proprietary data, largely unavailable for study, in the hands of social media oligarchs — we need both more technical sophistication than ever and analysis that probes for variables hidden in plain sight.

As the Harvard statistician Xiao-Li Meng has written in the abstract of an otherwise highly technical paper, “without taking data quality into account, population inferences with Big Data are subject to a Big Data Paradox: the more the data, the surer we fool ourselves.”

main topic: U.S. Politics
related topics: Books