Skip to main content

Sports-Math Study Hall: The Spread vs. Probability of Victory

A few years back, I developed a sports-math fascination that quickly turned into a bit of an obsession. The topic? The relationship between the Vegas Spread and the probability that the favored team will be victorious. I am not sure exactly where this came from, but its origin is probably linked to EPSN’s FPI metric, which attempts to do some very similar things to what I try to do, but with seemingly much more dubious methodology (such as an undying reliance on the importance of recruiting rankings over on-field results). In any event, I became a bit obsessed with finding the answer to the question of how spreads correlate to the probability of victory. I have tried a few google searches to find out if someone else has written on the subject, and so far I not come up with much at all. But, I think I have found that answer, or at least I think I am very close.

On first glance, you would think that this question should be quite simple to solve. After all, you just need to plot the data a fit a line to it right? Well, as it turns out, spread vs. victory data is pretty scattered and you need quite a bit of data to make sense of it. I have been logging spread data since 2009, which now includes over 4000 data points. I have not yet added the 2017 data, but for those 8 full years of college football data, this is what the raw data looks like:


To a first approximation, it looks like it might be a linear-ish line that reachs 100% somewhere in the mid-20. But, this is not a terribly satisfying result. So, I decided to see what would happen if I tried to smooth the raw spread data using a 7-point boxcar / moving average method.  If I do that, the data now looks like this:


To me, this plot makes a bit more sense, as it has the shape that one would generally expect. That is, when the spread is a pick'em (zero) the probability of victory is naturally 50% and the curve approaches 100% asymptotically as the spread increases.  It does get a little wonky as the spread increases, however, mainly because there is generally less data in that region and a single upset can cause a visible spike, as is clear from the raw data.

In an attempt to fit this data, I decided to use a simple quadratic equation, also shown above. In this case, setting the parameters was pretty straightforward, as the function should be 0.50 at x=0 and I somewhat arbitrarily set the probability of victory to reach 100% at a spread equal to 30. This is based on the general observation in the data that upsets do tend to happen up until the spread reaches 30. After that, they are very rare (in the 8 yr span, there are only 2 upsets out of almost 250 games with spreads this large). I found that a very simple equation fits the data reasonably well.  It is:

where x is the opening Vegas spread. Once the spread exceeds 30, I just set the probability of victory at 100% which is not true, but was "close enough at the time." I used this formula in my various mathematical calculations for several years. 

But, it was not very satisfying. I felt like there should be a more mathematical answer to this question rather than just a pretty-looking empirical formula. Then, earlier this year, I was having a related discussion on line when someone gave me a hint that allowed me to derive a solution to this problem. That hint was the observation that teams that are favored by x-points will win by an average of... x-points. This seems somewhat obvious, but I never actually checked this fact. So, the first thing I did was to plot this data, which looks like this:


As you can see, based on 8 years of data, the statement above is true. There is obviously more scatter in the data once the spread gets above that magical value of 30, but once again, that is due to increasingly sparse data at the higher spreads. But, this got me thinking, if I can take the average value of the margin of victory at each value of the spread, I can perform other mathematical / statistical operations. The general problem in the raw data is that there is just not enough of it to see the "real" correlation easily. But, what if I assume that the distribution of game outcomes fits a Gaussian / Normal Distribution which is centered on the spread? As we will see in a moment, if this is true, it is relatively simple to calculate the function that had eluded me for several years. But, the first question is: is the data at each value of the spread normally distributed?

As it turns out, there is still not enough data even after 8 years to see a true distribution at any given value of the spread. But, what you can do is to take all of the data together and plot the distribution in relative difference between the final margin of victory and the opening spread for all the data and see what it looks like. That data is shown here, including a best fit to a Normal Distribution curve:


While it is not a perfect fit for a bell / Gaussian curve, it is pretty close. Furthermore, this plot gives us another useful piece of information, which is the standard deviation of the margin of victory.  When all the data across all spreads is taken into account, the standard deviation is 15.84, just slightly over 2 TDs.  

From a pure football standpoint, this information in itself is pretty interesting. It speaks to the overall variance / chaos that is inherent to college football.  Based on the math of the normal distribution, 68% of the population should fall within one standard deviation of the mean (in this case, the opening spread). That means that Vegas can only get within 2 TDs of the actual margin of victory a little over half the time.  That is extraordinary.  I think that most football fans would intuitively think this number was less than a TD, but it is not. Chaos rules in reality. 

Getting back to the main point, it turns out that this same math behind the normal distribution allows us to solve the initial puzzle using basic statistics. This is because all statistical distributions have an associated probability density function that allows you to calculate the percentage of the population above and below any value. For the normal distribution, all you need is the mean (the opening spread), and the standard deviation (around 15.84) and you can calculate the probability of the population (in this case, the final point differential) being above or below a fixed number. Since we care about winning and losing, this value is simply zero.  Excel even has a simple formula "NORMDIST" that can be used. Literally all you need to generate this curve is make a column in Excel with the possible spreads (usually 1 to however high you want to go in increments of 0.5) and in a 2nd column use the formula 1-NORMDIST(0,spread,15.84,true), where "spread" is the value in the "spread" column.

Unfortunately, it is not quite that simple. The problem is the standard deviation. The assumption in the above paragraph is that standard deviation is fixed for all values of the spread. In reality, this does not seem to be quite true. If you actually plot the standard deviation as a function of the opening spread, you get this:


So, while the standard deviation does hover over 15 in cases where the spread is small, it generally trends down (again, with significant scatter).  So, one could simply use the linear regression in the formula above for the effective standard deviation, which would seem to make the most sense.

But, as it turns out, using both the fixed value and the linear fit above give a curve that does not quite fit the data. In both cases, the curve under-predicts the probability of the favored team winning, see below:


So, with some amount of hesitation, I decided to use my box-car smoothed data to perform a regression on the standard deviation data to find an "optimized" line. That optimized fit to the standard deviation line is shown here in red.


Clearly, this line doesn't fit the standard deviation data as well, but if you use this line and generate the probability of victory curve, you get this:


I think this fits the data pretty damn well. Yes, I did have to use one adjustable parameter instead of zero, but I am pretty happy with it. It also has the benefit that the probability of victory goes to 99% when a team is favored by 31.5 points, which seems in line with reality.  The fact that I had to use an empirical fit on the standard deviation data is a bit weird, and I am not sure why it is needed, but it might suggest the population distribution is not exactly Gaussian, or something about the variance of the population versus the variation of my (still very large) sample. I did not make an exhaustive search of other distributions, so there may be a better one out there. As I accumulate more data, I might be able to figure that out.

Quite honestly, I am not sure if this is the method used by ESPN or Nate Silver, etc. When they show probabilities of victory, their numbers are similar to mine but not quite the same. Regardless, I do believe that this is the correct methodology, even if the exactly correct value for the standard deviation is still a bit unclear. 

That is all for now. Thus ends the lesson. Enjoy!




Comments

Popular posts from this blog

Dr. Green and White Helps You Fill Out Your Bracket (2024 Edition)

For as long as I can remember, I have loved the NCAA Basketball Tournament. I love the bracket. I love the underdogs. I love One Shining Moment. I even love the CBS theme music. As a kid I filled out hand-drawn brackets and scoured the morning newspaper for results of late night games. As I got older, I started tracking scores using a increasing complex set of spreadsheets. Over time, as my analysis became more sophisticated, I began to notice certain patterns to the Madness I have found that I can use modern analytics and computational tools to gain a better understanding of the tournament itself and perhaps even extract some hints as to how the tournament might play out. Last year, I used this analysis to correctly predict that No. 4 seed UConn win the National Title in addition to other notable upsets. There is no foolproof way to dominate your office pool, but it is possible to spot upsets that are more likely than others and teams that are likely to go on a run or flame out early.

The Case for Optimism

In my experience there are two kinds of Michigan State fans. First, there are the pessimists. These are the members of the Spartan fan base who always expect the worst. Any amount of success for the Green and White is viewed to be a temporary spat of good luck. Even in the years when Dantonio was winning the Rose Bowl and Izzo was going to the Final Four, dark times were always just around the bend. Then, there are the eternal optimists. This part of the Spartan fan base always bets on the "over." These fans expect to go to, and win, and bowl games every year. They expect that the Spartans can win or least be competitive in every game on the schedule. The optimists believe that Michigan State can be the best Big Ten athletic department in the state. When it comes to the 2023 Michigan State football team, the pessimists are having a field day. A major scandal, a fired head coach, a rash of decommitments, and a four-game losing streak will do that. Less than 24 months after hoi

2023 Final Playoff and New Year's Six Predictions

The conference championships have all been played and, in all honesty, last night's results were the absolute worst-case scenario for the Selection Committee. Michigan and Washington will almost certainly be given the No. 1 and No. 2 seed and be placed in the Sugar Bowl and the Rose Bowl respectively. But there are four other teams with a reasonable claim on the last two spots and I have no idea what the committee is going to do. Florida State is undefeated, but the Seminoles played the weakest schedule of the four candidates and their star quarterbac (Jordan Travis) suffered a season ending injury in the second-to-last game of the regular season. Florida State is outside of the Top 10 in both the FPI and in my power rankings. I also the Seminoles ranked No. 5 in my strength of record metric, behind two of the other three candidates. Georgia is the defending national champions and were previously ranked No. 1 coming into the week. But after losing to Alabama in the SEC Title game,