Tuesday, April 1, 2014

Why We Haven't Done All that Much with the House [WARNING: MATH AHEAD]

It's recommended that liberal arts majors turn back at this point. Or skip to the very bottom.

Most of the stuff you'll see on here is about the Senate. We've got a real nice map of our current forecast on the right-hand sidebar with nice colors, and there's a nice long summary of analyses of the 36 Senate races this year (also with nice colors).

So, you might wonder, what about the 435 House races?

There are a couple reasons we haven't gone into as much detail about House races as we have with Senate ones. First, the poll-to-race ratio is a lot lower than that for Senate races. We're still a good seven months off from the election, and the most competitive Senate races have considerable polling already by respected firms. By contrast with the House, there are about 40 competitive (Toss-Up or Leans toward one party) races as reported by Larry Sabato's Crystal Ball via 270towin.com. The number of races is large, congressmen are less well known than senators, and the races are almost always more localized than Senate races, so the audience for polls of individual House races is a lot smaller. It's simply not worth the money to conduct those polls.

There's also a technical limitation to polling individual House districts: in order to obtain a large random sample for the survey, pollsters use what's known as random digit dialing to select which phones to call. The link elaborates specifically on Pew Research's methodology, but in short, a computer is used to generate a random seven-digit number, which is combined with an area code to form a phone number, which is dialed and administered the survey. This works fine in Senate races, since area codes never overlap between states, e.g. if you dial an 856 number, you are guaranteed to call a phone in New Jersey (or a cellphone with a plan purchased in New Jersey) and not Pennsylvania or Delaware. However, congressional districts aren't so considerate and don't line up as nicely as state lines do. Going with the 856 example, my home district, NJ-3, is covered by the 856 area code. But so is the 2nd district, and about half of the 1st district. If we wanted to poll voters in NJ-3 by random digit dialing phone numbers with area code 856, we'd have a pretty substantial chance of including someone in NJ-2 or NJ-1 in the poll. We could get around this by, say, asking respondents their zip code and filtering out the ones who don't live in NJ-3, but we'd need to ask a lot of people before we could get a sizable sample. It could possibly end up costing more than a given Senate poll that doesn't have this hurdle.

Second, the House races are less well-defined. Possibly due to the number of races, each Senate race is viewed separately, while we often discuss the House in terms of all House elections taken together. While pollsters haven't been polling individual House races for the reasons above, they have been polling the generic ballot--"if the election were to be held today, would you vote for the Republican candidate or the Democratic candidate in your district?"--for decades now. And this can be helpful. 

We've created a model, based on the 17 midterm elections since Gallup began polling the generic ballot in 1946, of looking at House elections that sort of skips over any polling of individual races. What the model tries to calculate is what we're calling the "swing": the percent change in the number of House seats not controlled by the president's party. For example, if the Republicans controlled 200 House seats and the model predicted a swing of 10%, Republicans would be predicted to gain 10% of 200 seats, or 20 seats. Here's the summary regression table we found:

*Correction: This should say 16 observations. We omitted the 1958 election from the data, for reasons discussed below.e

The variables are as follows: "Generic" is the president's party's standing in the last Gallup generic House poll before the election; "Approval" is the sitting president's approval rating at the time of the election, or as close as there exists data (data from mid- to late October usually suffices); "RepublicanP" is a dummy variable that takes on a value of 1 if the president is a Republican and takes on a value of 0 if the president is a Democrat. According to the model, all of these variables are negatively correlated with the number of seats the opposition party will pick up at a statistically significant 0.05 level, and the model fits extremely well to the data we have (R2 = 0.8950). 

What this means is that given values of those three variables, we can predict the number of seats that will change hands in November. Currently, for example, President Obama's approval rating according to Gallup is 45%, and Rasmussen's generic ballot shows a one-point lead for Democrats, which will say comes out to a 51-49 lead by November (this is all for example's sake, so don't worry about the accuracy of the numbers). We plug 'n chug the values into the model:

Swing = 86.27344 - 1.100497*Generic - 0.401977*Approval - 5.904074*RepublicanP
= 86.27344 - 1.100497(51) - 0.401977(45) - 5.904074(0)
= 12.06

This indicates that if the election were to be held today, Republicans would see a 12.06% increase in their number of House seats. This is equal to a gain of 0.1206*233 = 28.1, so with those numbers the model predicts that if the election were held today, Republicans would pick up around 28 seats. Oh dear. (Read on before telling us how crazy we are.)

Things to beware
Any model as simple as this one proposed to describe events as complex as congressional elections should be treated with a grain of salt (okay, maybe several grains of salt). First, the sample size used to generate this model was really, really small. 15 midterm elections is not a lot of data. And when we say "significant", it really just means we know the direction of the correlation. We can't pin down exactly how strong the correlation is. Here are the 95% confidence intervals for the above regression:



So, for example, the effect of the party of the president could be enormous (10 points more favorable toward Republican presidents) or it could be pretty small (1.6 points more favorable toward Republican presidents). The same goes for the other variables.

Second, it's often difficult to tell which variables we should put in: there's a trade-off between including fewer variables while possibly missing some variables or interactions that could be significant and including more variables while diluting the significance of the truly important variables. Other variables we tried including were congressional approval rating, unemployment, second-quarter GDP growth, the president's disapproval rating, and whether the opposition party was also the minority party, none of which seemed to be as significant as the variables we have in the model.

Third, there's the possibility of an overspecialized model--essentially, one that contorts and stretches to accommodate as much of the data as possible. The problem with a model that does this is that not all of the data is representative of a trend; sometimes you just have weird outliers that can't be explained by conventional wisdom or historical trends. We actually attempt to avert this overspecialization problem in this model by omitting the 1958 election from the data. 

The tip-off was that the fit improved tremendously when the 1958 election was removed from the data used to generate the model. Removing a data point on the basis of "it didn't fit well" alone is bad science and bad statistics: theory must exist to back up what goes into the model and what gets thrown out. We have theory for why we've included and excluded certain things:
  • The generic ballot. This is pretty intuitive; the generic ballot is a gauge simply of which party is more popular. The more popular party will tend to win more seats; that's how representative democracy is supposed to work. Since swing is described in terms of the percentage gained by the opposition and the generic ballot value is the number for the president's party, we expected a negative correlation (which bore out).
  • The president's approval rating. Midterm elections, more so than presidential elections, tend to be a referendum on the president. It's the voters' midpoint opportunity to show the president what they think of him before throwing him out or giving him a second chance. If more voters approve of the president's performance, they'll tend to help his party. We expected a negative correlation here as well, which also bore out.
  • The party of the president. This affects the generic ballot question: historically, the generic ballot has tended to overestimate Democratic chances: a tied generic ballot will mean Republicans end up picking up a few seats and winning the national popular vote by a few percentage points. This is a natural result of the demographics of midterm elections--the electorate, compared to that of presidential elections, skews older, whiter, and more male, all of which helps Republican candidates. We expect that the generic ballot will be biased against the opposition if they're Democrats and the president is a Republican; therefore, we expected a negative correlation here as well (which also bore out).
  • Throwing out 1958. The best theory I have for the weirdness of 1958 is a culmination of historical events, not trends, which is why I feel comfortable not including it in the data. The model vastly underestimates the Democratic pickups in 1958, when, despite President Eisenhower's approval ratings sitting above 50%, the generic ballot showed a 14-point Democratic lead, which materialized in the form of 49 extra seats that November. We have two basic ideas of what caused such a large-scale repudiation of congressional Republicans in 1958: 1) a deep, worldwide recession that began that year and 2) a perceived tarnishing of the U.S. reputation abroad, symbolized first by the Soviet launch of Sputnik in October 1957, which seemed to demonstrate that the U.S. was falling behind in the Cold War; and second by a cold reaction from Latin American countries to U.S. involvement in the Western Hemisphere (watch protesters mob Vice President Nixon's motorcade in Caracas, Venezuela in May 1958).
What this means for the actual elections
As calculated above, if the election were held today the model would predict Republican gains of around 28 seats. While we don't know what those seats are, we do know that the results in many House races are not completely independent events. A "wave" election where one party performs exceptionally well nationally--such as 2006 for the Democrats or 2010 for the Republicans--tends to help all of its candidates: the old "rising tide lifts all boats" adage. Therefore, we believe we'll be able to get a pretty good guess at which seats will change hands, just by flipping the seats from most marginal to least. We'll be developing that list of seats later on. 

More specifically for the 2014 elections, it should be a cold shower for any Democratic donors who hope to see the Democrats regain their majority after November. Our point estimate--with a lot of variability--is that Republicans will pick up somewhere in the neighborhood of 20, 30 seats. There are only approximately competitive seats up for grabs, so we're pretty sure it's not nearly that high (we think it'll be closer to, I don't know, 10). But our point still stands: To win the House back, Democrats have to pick up 19 seats. Right now, I'd say that's not happening any time soon.

No comments:

Post a Comment