We’re screwed, and we have a lot of deep issues to think and argue about. Most are so hard to resolve that we are likely to slip into the sorts of mutual recriminations that have plagued many recent exchanges here. The question of how to read polling data, however, is sufficiently well defined, with fast enough real-world feedback, that we can use it to make some conclusions. And then maybe we can think about the extension of those conclusions to the more contentious questions of what to do next.
There were many poll aggregators and other election predictors this season. These included Drew Linzer for DKos, Nate Silver for 538, Nate Cohn for the NYT Upshot, Sam Wang for the Princeton Election Consortium, a collection of bookies and gamblers from around the world, Predictwise, PollyVote,… The two most discussed here were Silver and Wang, who conveniently represent pretty much the opposite ends of the prediction spectrum. There were many differences between their methods, as I shall discuss, but the biggest difference was that they had radically different estimates of the uncertainty in the predictions. Silver thought the uncertainty was large, Wang thought it was small. So Silver gave about a 30% chance of Trump winning and Wang gave less than 1%. Needless to say Silver was more realistic. I’ll go over some of the technicalities, but that’s not the point of this story. The point is to encourage introspection on the part of the dozens of DKos bloggers and commenters (and Huffpo writers, etc.) who were highly critical of Silver and his defenders and in the process revealed some mental habits that may be harmful in the very hard times ahead.
The most persistent theme of the commenters here was to explain in detail the motives for each change or lack of change in the 538 odds. Everything was treated as if it were a decision made for a reason. Some results were supposed to be just “clickbait”, others reflected Silver’s corporate conservative streak, some his embarrassment over having written (contrary to his quantitative model) that Trump was unlikely to get the nomination, and so on. The idea that the whole output was determined by a preset algorithm seemed incomprehensible to most Kossacks, no matter how often several of us pointed it out. So all that psychoanalysis was devoted to the output of a long-standing computer algorithm! In effect, it was a Rohrschach test, in which people revealed the sorts of things that they themselves would do with data. (Technical note: Silver actually made a couple of small adjustments. One was to include the possibility off McMullen winning UT, an unforeseen and irrelevant complication. The other was to include a new low reliability rating for the huge new state-by-state dumps of wildly fluctuating results from several online surveys, such as Google’s. )
Silver’s algorithm had a number of distinct features. These included grades for pollsters, in which the weight given to their polls depended on past accuracy, and house-effect adjustments by which pollsters with strong R or D leans would get adjusted to compensate for that. Both of these features tend to reduce jitter in the results, making less click-bait news than simpler aggregators (e.g. Huffpo) find. A trend-line adjustment was also included, in which older polls were adjusted to reflect any trends consistently found since they were taken in national polls, other polls of the same state, or polls of states with similar demographics. Again, all the parameters for this adjustment were put in at the start, having been tuned to do a good job on results from past years. This is just a common-sense way of making the aggregate responsive to changes, without introducing too much extra noise. When HRC was going down in the polls, people here thought that the trend-line was a subjective thumb on the scale and should be eliminated. When she was going up they thought that it wasn’t responsive enough. When R-leaning polls were adjusted toward D, they didn’t notice. When D-leaning pols were adjusted toward R, they protested the thumb on the scale.
Wang’s model was extremely popular here. One reason given was that it was simpler, which it was. Other reasons given were that it was more stable (actually the expected margin was about as changeable as Silver’s), that it involved no subjective changes (actually, there was a huge subjective change midstream, in which the expected variability was cut by a large fraction), that it didn’t use t-distributions (but Wang said it did), and so on. In other words, people were just making things up to justify a preference that had an entirely different motive. That motive wasn’t subtle- Wang’s model said early on that the probability of an HRC win was extremely high, and kept that prediction (getting to 99+%!) right up to the election. It sounded good so people made up traits it didn’t have about statistical features they didn’t understand to give objective-sounding reasons for why they believed it.
One of the most common sayings around here was that regardless of the national polls, all that mattered was state results, because of the electoral system. The electoral vote was supposed to be secure regardless of the polling changes. The problem was that in Silver’s simulations it was far more common for HRC to win the popular vote and lose the election than the other way around. The sayings here were not only not based on data, they directly contradicted the only serious attempt to see what the data indicated.
OK, that’s enough for now of the technical details about unimportant differences between the models, and how people here misunderstood them and used them as excuses to not pay attention to Silver. What about the difference that mattered- the radically different estimates of the uncertainty? Silver’s justification for large uncertainty was very simple: the national polling aggregate was off about 3% (D) in the 2014 Senate elections, about 3% (R) in the 2012 Presidential election, almost 3% (R) in the 2000 Presidential election. Due to non-response bias, polling is getting harder, not easier. Pollsters have to guess which groups will actually turn up and vote, and that’s not easy. So errors like that are not rare. The national error this year was about 2.3% (D). (1.5%: see update) Beyond those national errors, there can be even larger errors in groups of similar states, e.g. the rust belt. So Silver specifically warned that if NH started to look weak, that could be the first sign that states with lots of old white folks were weakening, and that the rust belt might not hold. He said that effect had a good chance to swing the Electoral College, but less chance to swing the national popular vote.
Almost no one here listened. People who paid attention and understood the reasoning were called “concern trolls”, “Debbie Downers”, “bed-wetters”, etc. We heard calls to expand the campaign beyond AZ to MO, GA, even TX and SC. We heard that thanks to minority voters and/or women a loss was impossible.
What about Wang’s very low estimate of the inter-state correlated uncertainty, the estimate so many serious-sounding people here described as much sounder and more objective? I’ll let him have the last word on that:
I did not correctly estimate the size of the correlated error – by a factor of five. As I wrote before, that five-fold difference accounted for the difference between the 99% probability here and the lower probabilities at other sites. We all estimated the Clinton win at being probable, but I was most extreme.
So what are the lessons to draw? The technical ones are hard to summarize quickly, especially since we can see that some very smart and well-educated people got them wrong. The less technical ones are more interesting here.
The same absolute refusal to consider dissenting views, warning signs, etc. that was provably mistaken on the numerical issue showed up on all other issues related to the election- what candidate to choose, how to approach voters, who to write off, who to try to persuade, etc.. The same abuse, playground taunts, appeals to group identity, etc. That stuff ain’t productive. It’s scary, though not as scary as a Republican Supreme Court, a speed-up of global warming, and kicking 20,000,000 people off health insurance, including one of my sons.
Early in the primary season, the same group-think mentality inhibited any realistic assessment of vulnerabilities. If we were unable to even listen to each other, we were even more unable to listen to others. HRC was the sure-fire candidate as the most popular woman on the planet, and besides it was so unfair that she was strongly disliked by half the voters.
Are Hillary supporters the only culprits? Hardly, there were some very unrealistic election memes circulating among Bernie supporters later on in the primaries. As for the polls showing consistently that Bernie would do 5% or more better than HRC against Trump, we will simply never know whether any of that would have held up to the rigors of a real and dirty campaign.
Would it have mattered if DKos readers and similar types had been more realistic? Maybe- we wouldn’t have lulled people in swing states into thinking it was in the bag. It seems (based on the leaked emails, etc.) that the HRC campaign was more tough-minded than DKos readers, but still evidently prone to some of the same lack of realism. They didn’t go back to MI until hitting a bit of panic at the end, and never went to WI.
There are going to be extreme challenges in the years ahead. On the most important issue, the global environment, we are likely to mostly fail, with catastrophic results. We will still be the same species, and prone to the same systematic mental errors. But it’s still possible to fight them a bit, to try to train ourselves to be a bit more objective. Then maybe we won’t fail quite as catastrophically.
Read More