StatSheet: Know when to fold 'em...

...and know when to run.

Chances are, if you're running for president, you won't win. Obvious though it may seem, it was still kind of surprising to me to actually read it in James Oliphant's article in The National Journal: "Running for the presidency is, at heart, a loser's game. In the past 35 years, scores of men and women have tried, and only five have made it."

Of the eight luminaries on this stage--including three governors, a senator, two congressmen, a Speaker of the House, and a pizza guy--a whopping 0% would go on to become President of the United States.

Because of this, not every candidate tries to see primary season through to the end--and why would they? Running a presidential campaign costs cash and credibility (just ask Rick Perry about the latter), and not every candidate has much to spare. Unfortunately, just looking at campaign data shows very little rhyme or reason as to when or why it is exactly candidates finally decide to pull out of the race. Maybe unforeseen circumstances, like the hospitalization of Rick Santorum's daughter in 2012, force the candidate to figure out his priorities. Maybe the campaign can't sustain the cash flow needed to keep a campaign going. Or maybe the candidate has simply decided it's a losing battle, with an impossibly narrow path to victory.

This question--how do we know when to fold 'em?--came up as I was developing a state-by-state model for eyeballing--and this is really, really, really rough eyeballing--who would win the Republican nomination for president next year. State-by-state polling, especially for early states like Iowa and New Hampshire, or delegate-rich prizes like Florida, is readily available, as is a tentative calendar for primary elections. Both of these should allow us to estimate, day-to-day, how many delegates each candidate has accumulated.

However, one problem--among others--is that polling is highly fluid. A lot of assumptions go into this ballpark, including (but by no means limited to):

Polling now is a good indicator of polling next year. (It almost certainly isn't.)
The candidates for whom I do these estimates will run. (This is pretty much guaranteed to not happen.)
Electoral performances in different states are independent events. (This is the most egregious one: because the elections are on different dates across the nation, a candidate who can lay claim to some momentum early on can see his chances snowball.)

In fact, the assumptions were so far-fetched that I eventually just gave up all hope of being able to predict the winner of the nomination and decided to focus on this simple phenomenon--when do candidates drop out of the race?

Specifically, I decided to frame the question as "what is the probability that a candidate with certain attributes drops out of the race within a week of a time t?" When looking at it this way, I decided the best way to determine those probabilities was to use a logistic regression on some variables from the 2012 and 2008 Republican primaries for president, with a binary variable "Withdrew" as the dependent variable. These variables included things like estimated number of delegates won to date, number of states won to date, and number of weeks since the Iowa caucus (which was a completely arbitrary starting point). Each observation in the data set was a candidate and a "check-in" date at which to compute the values of other variables for that candidate at that date--so there would be observations for Mitt Romney on January 3, 2012; January 10, 2012; January 17, 2012; and so on. But there would also be observations for Rick Santorum on those dates, and for the other candidates whose candidacy lasted until those dates as well.

All right, got that out of the way. What does the model predict? Its output is a value from 0 to 1 indicating the probability that a candidate will drop out of the race within a week of a given date, conditional on the values of some other variables. I tried a bunch of combinations of variables in R, and (rather disappointingly) it's tough to find any significant variables, let alone a well-fit model. However, I did find two models that looked promising. The first (creatively named "model 1") is given by:

t = β₀ + β₁ × DelegateShare + β₂ × StatesWon + ε

where t is plugged in as the independent variable in the logistic function. DelegateShare is equal to the proportion of allocated delegates that are projected to go to one candidate--so if 30 delegates have been pledged, and 15 of those have been pledged to a candidate, then DelegateShare is 0.5. StateWeeks is a crude index designed to account for both the passage of time and the number of states a candidate has won over that time period, the intuition behind this variable being that as the weeks pass throughout campaign season, a candidate needs to continue winning states in order to be viable, and if he doesn't keep pace he'll be more and more likely to drop out.

Here's the summary of model 1:

Variable	Estimate	Std. Err.	z	P
Intercept	-1.2412	0.4767	-2.604	0.00921
DelegateShare	-10.1780	4.3097	-2.362	0.01819
StateWeeks	0.1521	0.0763	1.994	0.04612

Pseudo-R²	0.1287

While the meaning of each variable's coefficient is not simply determined (it matters where you are on the logistic curve to begin with), the signs still have intuitive significance. A positive sign means that as the value goes up, the probability of a candidate withdrawing from the race goes up; a negative sign means that as the value goes up, the candidate's probability of doing that goes down.

Just for comparison, here's the equally creatively named "model 2":

t = β₀ + β₁ × DelegateShare + β₂ × StatesWon + ε

DelegateShare means the same thing as it did in model 1; StatesWon is exactly what it says on the tin: the number of states won by a candidate by a given date. Model 2's summary is given below:

Variable	Estimate	Std. Err.	z	P
Intercept	-1.5890	0.4249	-3.730	0.000192
DelegateShare	-20.6662	9.2212	-2.241	0.025016
StatesWon	0.5200	0.2294	2.266	0.023434

Pseudo-R²	0.1691

Pros and cons

Both models have their merits. Both of them have significant variables (at P < 0.05); both of them have fairly poor fit judged by the pseudo-R², although model 2 is slightly better in that respect. However, I'd be more willing to use model 1 as a predictor than model 2, for the simple reason that the signs on the variables in model 1 make more sense. Model 1 essentially says that as a candidate's share of the delegates goes up, the probability of the candidate dropping out of the race decreases (and by quite a bit, too). At the same time, it says that as time passes by without the candidate winning a state, the probability of the candidate dropping out goes up. Both of these are reasonable conclusions we might make without a statistics package.

Model 2 offers somewhat stranger conclusions. The relationship between dropping out and share of the delegates is still negative, so that's a plus. But what model 2 also suggests is that as a candidate wins more and more states, he becomes more and more likely to drop out of the race. As a result, model 2 wildly overpredicts the probability that the front runners, who have won more states, drop out, and also underpredicts the probability that any candidate drops out very early on. According to model 2, a candidate who wins 10% of the delegates and 6 states is twice as likely to drop out as a candidate who wins the same share of delegates and just 4 states.

Overall, it's still kinda meh

Although model 1 is better, it's still not great. The first problem is structural and out of my hands--dropping out is inherently a rarer event than staying in, since a candidate can only drop out once, but can stay in as many weeks as he likes. You might say that the upper bound on the number of dropouts is only as high as the number of candidates in the field, while the upper bound on the number of stay-ins (or whatever) is the number of candidates in the field multiplied by the number of weeks they stay in.

One problem that shouldn't really be a problem is money. The data set I compiled doesn't include fundraising data--which is odd, since you'd expect money to be a reason behind many candidates' departures. Turns out, though, that a lot of the effects money has on whether a candidate stays in or drops out either don't help to predict that, or the effect is already accounted for by other variables.

Ultimately, though, we can't really be sure if this model is good or not. It's really just to satisfy my curiosity, but it should be fun to look at--a year from now.

StatSheet

Saturday, February 14, 2015

Know when to fold 'em...

No comments:

Post a Comment