Sports-Math Study Hall: The Spread vs. Probability of Victory

A few years back, I developed a sports-math fascination that quickly turned into a bit of an obsession. The topic? The relationship between the Vegas Spread and the probability that the favored team will be victorious. I am not sure exactly where this came from, but its origin is probably linked to EPSN’s FPI metric, which attempts to do some very similar things to what I try to do, but with seemingly much more dubious methodology (such as an undying reliance on the importance of recruiting rankings over on-field results). In any event, I became a bit obsessed with finding the answer to the question of how spreads correlate to the probability of victory. I have tried a few google searches to find out if someone else has written on the subject, and so far I not come up with much at all. But, I think I have found that answer, or at least I think I am very close.

On first glance, you would think that this question should be quite simple to solve. After all, you just need to plot the data a fit a line to it right? Well, as it turns out, spread vs. victory data is pretty scattered and you need quite a bit of data to make sense of it. I have been logging spread data since 2009, which now includes over 4000 data points. I have not yet added the 2017 data, but for those 8 full years of college football data, this is what the raw data looks like:

To a first approximation, it looks like it might be a linear-ish line that reachs 100% somewhere in the mid-20. But, this is not a terribly satisfying result. So, I decided to see what would happen if I tried to smooth the raw spread data using a 7-point boxcar / moving average method. If I do that, the data now looks like this:

To me, this plot makes a bit more sense, as it has the shape that one would generally expect. That is, when the spread is a pick'em (zero) the probability of victory is naturally 50% and the curve approaches 100% asymptotically as the spread increases. It does get a little wonky as the spread increases, however, mainly because there is generally less data in that region and a single upset can cause a visible spike, as is clear from the raw data.

In an attempt to fit this data, I decided to use a simple quadratic equation, also shown above. In this case, setting the parameters was pretty straightforward, as the function should be 0.50 at x=0 and I somewhat arbitrarily set the probability of victory to reach 100% at a spread equal to 30. This is based on the general observation in the data that upsets do tend to happen up until the spread reaches 30. After that, they are very rare (in the 8 yr span, there are only 2 upsets out of almost 250 games with spreads this large). I found that a very simple equation fits the data reasonably well. It is:

where x is the opening Vegas spread. Once the spread exceeds 30, I just set the probability of victory at 100% which is not true, but was "close enough at the time." I used this formula in my various mathematical calculations for several years.

But, it was not very satisfying. I felt like there should be a more mathematical answer to this question rather than just a pretty-looking empirical formula. Then, earlier this year, I was having a related discussion on line when someone gave me a hint that allowed me to derive a solution to this problem. That hint was the observation that teams that are favored by x-points will win by an average of... x-points. This seems somewhat obvious, but I never actually checked this fact. So, the first thing I did was to plot this data, which looks like this:

As you can see, based on 8 years of data, the statement above is true. There is obviously more scatter in the data once the spread gets above that magical value of 30, but once again, that is due to increasingly sparse data at the higher spreads. But, this got me thinking, if I can take the average value of the margin of victory at each value of the spread, I can perform other mathematical / statistical operations. The general problem in the raw data is that there is just not enough of it to see the "real" correlation easily. But, what if I assume that the distribution of game outcomes fits a Gaussian / Normal Distribution which is centered on the spread? As we will see in a moment, if this is true, it is relatively simple to calculate the function that had eluded me for several years. But, the first question is: is the data at each value of the spread normally distributed?

As it turns out, there is still not enough data even after 8 years to see a true distribution at any given value of the spread. But, what you can do is to take all of the data together and plot the distribution in relative difference between the final margin of victory and the opening spread for all the data and see what it looks like. That data is shown here, including a best fit to a Normal Distribution curve:

While it is not a perfect fit for a bell / Gaussian curve, it is pretty close. Furthermore, this plot gives us another useful piece of information, which is the standard deviation of the margin of victory. When all the data across all spreads is taken into account, the standard deviation is 15.84, just slightly over 2 TDs.

From a pure football standpoint, this information in itself is pretty interesting. It speaks to the overall variance / chaos that is inherent to college football. Based on the math of the normal distribution, 68% of the population should fall within one standard deviation of the mean (in this case, the opening spread). That means that Vegas can only get within 2 TDs of the actual margin of victory a little over half the time. That is extraordinary. I think that most football fans would intuitively think this number was less than a TD, but it is not. Chaos rules in reality.

Getting back to the main point, it turns out that this same math behind the normal distribution allows us to solve the initial puzzle using basic statistics. This is because all statistical distributions have an associated probability density function that allows you to calculate the percentage of the population above and below any value. For the normal distribution, all you need is the mean (the opening spread), and the standard deviation (around 15.84) and you can calculate the probability of the population (in this case, the final point differential) being above or below a fixed number. Since we care about winning and losing, this value is simply zero. Excel even has a simple formula "NORMDIST" that can be used. Literally all you need to generate this curve is make a column in Excel with the possible spreads (usually 1 to however high you want to go in increments of 0.5) and in a 2nd column use the formula 1-NORMDIST(0,spread,15.84,true), where "spread" is the value in the "spread" column.

Unfortunately, it is not quite that simple. The problem is the standard deviation. The assumption in the above paragraph is that standard deviation is fixed for all values of the spread. In reality, this does not seem to be quite true. If you actually plot the standard deviation as a function of the opening spread, you get this:

So, while the standard deviation does hover over 15 in cases where the spread is small, it generally trends down (again, with significant scatter). So, one could simply use the linear regression in the formula above for the effective standard deviation, which would seem to make the most sense.

But, as it turns out, using both the fixed value and the linear fit above give a curve that does not quite fit the data. In both cases, the curve under-predicts the probability of the favored team winning, see below:

So, with some amount of hesitation, I decided to use my box-car smoothed data to perform a regression on the standard deviation data to find an "optimized" line. That optimized fit to the standard deviation line is shown here in red.

Clearly, this line doesn't fit the standard deviation data as well, but if you use this line and generate the probability of victory curve, you get this:

I think this fits the data pretty damn well. Yes, I did have to use one adjustable parameter instead of zero, but I am pretty happy with it. It also has the benefit that the probability of victory goes to 99% when a team is favored by 31.5 points, which seems in line with reality. The fact that I had to use an empirical fit on the standard deviation data is a bit weird, and I am not sure why it is needed, but it might suggest the population distribution is not exactly Gaussian, or something about the variance of the population versus the variation of my (still very large) sample. I did not make an exhaustive search of other distributions, so there may be a better one out there. As I accumulate more data, I might be able to figure that out.

Quite honestly, I am not sure if this is the method used by ESPN or Nate Silver, etc. When they show probabilities of victory, their numbers are similar to mine but not quite the same. Regardless, I do believe that this is the correct methodology, even if the exactly correct value for the standard deviation is still a bit unclear.

That is all for now. Thus ends the lesson. Enjoy!

Dr. Green and White Helps You Fill Out Your Bracket (2025 Edition)

For my money, we are all of the cusp of the best three weeks of the entire year. We just wrapped up two weeks of conference tournaments, but those were just an appetizer to the main course that is yet to come. The powers that be gave us the menu on Sunday evening for the feast that is to come. Now it is time to enjoy a brief break and palette cleaner before we all make our selections. But what shall we choose? Which tasty little upset looks the best in the first round? Which teams are most likely to be sweet in the second weekend? Which quartet will comprise the final course? Over the years I have developed a set of analytics and computational tools to gain a better understanding of the mathematical underpinning of the NCAA Basketball Tournament. My methodology has a solid track record of correctly identifying upsets and sometimes doing more than that. In 2023, I used data to correctly predict that No. 4 seed UConn win the National Title. There is no foolproof way to dominate your...

Dr. Green and White Sports Authority

Search This Blog

Sports-Math Study Hall: The Spread vs. Probability of Victory

Comments

Post a Comment

Popular posts from this blog

March Madness Analysis: Did the Selection Committee Get it Right in 2025?

Dr. Green and White Helps You Fill Out Your Bracket (2025 Edition)

2024 Week Eight Preview: OK Computer