StatSheet: Fishiness in the YouGov longitudinal survey

First things first: Kansas's Republican primary for the U.S. Senate seat there came out as expected, with incumbent Sen. Pat Roberts doing a little worse than polling predicted, although still coming out on top, 7 points ahead of runner-up Milton R. Wolf. So that's done. Kansas's Senate seat therefore remains SAFE REPUBLICAN for the general election in November, despite Democratic nominee Chad Taylor's surprisingly close performance in some polls.

Then there's the recent YouGov longitudinal survey conducted for CBS News and The New York Times, which I described as a "pollgasm"--there's just a ton of data in here. Importantly, I call it a "longitudinal survey" because its ambitious intent--as if polling a panel of 100,000 registered voters nationwide wasn't ambitious enough--is to track changes in opinion from that exact panel every month until the election. Since, for example, I participated in the panel as a voter from New Jersey, my understanding is that I'll be getting an email in a few weeks to ask me the exact same questions--whether I'll be voting for Sen. Cory Booker or Republican challenger Jeff Bell; whether I'll vote for the Democrat or the Republican running for Congress in my district (Aimee Belgard and Tom MacArthur, respectively, although YouGov doesn't know that); whether I approve or disapprove of President Obama's job performance, etc. And again in September, and ostensibly in October, too.

It's ambitious, but like many things that try to soar, it falls flat. I found some of the results to be rather dubious, to say the least. Here's one that caught my eye: in Alaska, Sen. Mark Begich apparently leads Republican front-runner Daniel S. Sullivan by 12 points, 49-37. That alone should raise eyebrows, but it's in comparison to other results that this becomes really weird: the same survey showed Sen. Booker ahead by only 7 points in deep-blue New Jersey. Taking into account margins of error it's entirely possible that these results allow for Sen. Begich and Sen. Booker to have identical leads (within a 95% confidence interval), but if you've given a passing glance to the maps on the sidebar of this blog, or if you even know anything about red states and blue states, you'll know that Sen. Begich is supposed to be in a lot more trouble than Sen. Booker is. So what gives?

I'm fairly certain the answer is in the demographic weights--or lack thereof, as the crosstabs would suggest. By party ID in Alaska, for example, the unweighted crosstabs add pretty much to the results CBS and the Times reported, 50-36 due to rounding errors if you're only including likely voters.

CBS / New York Times / YouGov poll for Senate election in Alaska crosstab by party ID.

First, it's entirely possible that Sen. Begich's lead is a spurious one that will evaporate after the Republican primary--Republicans look much more ambivalent about Sullivan than Democrats do about Sen. Begich, which is common during heated primaries. But I digress.

The aforementioned rounding error is the reason unweighted N = 409 likely voters even though 434 - 26 "won't vote" responses = 408. Anyway, the fact that this produces the YouGov result is very bad news for Sen. Begich because the demographics of the respondents aren't at all representative of the Alaskan electorate. The respondent pool for this survey is only a little bit more Republican than it is Democratic--but exit polling from previous elections and party registration statistics suggest that Republicans should outnumber Democrats by about to 2:1 in a midterm year. (Very crudely) extrapolating from 2008 exit polling (that is, generally decreasing the proportion of Democrats and increasing the proportion of Republicans to account for differences between a 2008 wave and a sixth-year-itch midterm), I came up with some closer-to-reasonable results:

CBS / New York Times / YouGov poll for Senate election in Alaska crosstab reweighted.

The weighted numbers at the bottom assume the Democrat:Independent:Republican ratio will be approximately 19:42:39, slightly more Republican than 2008's 21:42:37, which is to be expected. And it seems to work: a Begich lead of 44-42 is much closer to what we've been seeing throughout the year. In fact, most of the crosstabs I went through seem to have oversampled Democrats and self-identified liberals--although some of them oversampled self-identified conservatives as well (vastly undersampling moderates), so not all of the results seem to have overstated Democratic chances.

Please remember that this is a very crude approximation; I'm only using it to make a point. I have no intention of "unskewing" these polls to put them into my averages. I'm a lot of things, but unlike noted statistical illiterate Dean Chambers at Unskewed Polls (hilarious and worth a read if you can make it past the eye-watering graphic design), I'm not nearly arrogant enough to presume that I can go into the survey's demographics, add and multiply a few numbers with a bit of hand-waving magic, and produce results that favor whatever my political viewpoint is. So unless the second iteration of this study includes better demographic re-weightings, I'm not including them in my averages.

There is another conclusion we could draw from this--the crosstabs are bad. This isn't entirely far-fetched: the crosstabs for race and ethnicity seem very incomplete to me, since they only include whites, blacks, and Hispanics. Notably missing are the Asian/Pacific Islander and Native American ethnic groups. In a state like, say, Mississippi, where almost no Asian-Americans or Native Americans voted in the 2008 presidential election, this omission is acceptable and understandable from the point of view of printing crosstabs on a piece of paper--ink is expensive. But the terrible thing is that none of the crosstabs account for Asian-Americans and Native Americans. In states like Alaska, where a full tenth of the electorate identified as aboriginal in the 2008 election, or Hawaii, where 3 in 10 voters were Asian-American or Pacific Islander in the same election, this omission is incredibly glaring. It isn't even a possibility that YouGov simply didn't reach Asians and Pacific Islanders--I happen to know that at least one Asian in the state of New Jersey, where a non-negligible 3% of the 2012 electorate was Asian, participated in the survey. And yet according to the crosstabs, whites, blacks, and Hispanics make up the entire sample pool in every survey they did.

There are just too many things that seem off about this survey that I can't simply ignore. I'm not concerned too much--I'll just leave the CBS / New York Times / YouGov surveys out of my sampling pool when I do my averages.

StatSheet

Thursday, August 7, 2014

Fishiness in the YouGov longitudinal survey

No comments:

Post a Comment