I have presented most or all of this data before, but I thought it would be a good time for a quick refresher. A lot of people like to talk about the Vegas spread and how the purpose is just to get equal amounts of money on both sides. This is certainly true. However, there are a lot of very interesting mathematical / statistical facts about the spread that are also true and which can provide insight. I found a cache of data on the interest which contains historical spread and game results data back to 2001. This data set includes close to 12,000 games. Here are a few interesting facts:
1) "Vegas always knows" on average
Based on my analysis, Vegas is the best predictor of the outcome of a given game. If you plot the final spread vs. the average margin of victory, you get a very high correlation and a slope of 1.00:
1) "Vegas always knows" on average
Based on my analysis, Vegas is the best predictor of the outcome of a given game. If you plot the final spread vs. the average margin of victory, you get a very high correlation and a slope of 1.00:
2) However, the actual margin of victory vs. the spread has a ton of variance
If I just plot here all the results from 2017, the correlation is very weak and the scatter is tremendous:
Also notice all those data plots below the x-axis. Those are upsets, which account for roughly 25% of all games in a given season, every season. So, instead of just the average margin of victory, if one plots the standard deviation, you get this:
The standard deviation is 14-15 points, which if you think about it, is huge. That is like saying, "Team X is favored to beat Team Y by 5 points, ± 2 touchdowns." Also, this deviation from the spread is essentially normally distributed, as shown here:
So, another way to think about this is that roughly 2/3 of all games will have a margin of victory that is ±2 TDs from the spread. The crazy thing is, that implies that a full 1/3 of games will have a margin of victory over 2 TD from the spread. AND, for about 5% of games, you can expect the spread to be off by more than 4 TD, in either direction! Considering there are 50-60 games a week, this result is likely to be observed 2-3 times a week.
3) The probability of victory is well correlated to the spread
If we take all this data together, we can also plot the odds that the favored team will win any given contest, based on the spread:
The trend line is derived from the data shown above where for each spread I assume a normally distributed actual result with a standard deviation of about 14 (14.112, to be exact, gave the minimum average deviation). So, I only have one fitting parameter, which is nice. Despite having over 12,000 data points, there is still some scatter, but if I use a 7-point box car smoothing function, it looks like this:
You can see how well the trend line fits the data.
Now, there are still several very interesting questions. Does Vegas adjust the lines based on the known betting habits of certain fan bases? They almost certainly do, but I have never been able to detect any clear bias in the data. Also, it would certainly be very easy to do that for just a handful of games, and that data would just get swamped by all the other data. So, I just ignore this possibility. If I can't measure it systematically, I don't care about it.
So, if a team is favored by 10 points, this translates to a ~75% chance of victory. Does this mean: If those teams were to play 100 times, one team would (roughly) win 25 times and the other would win 75 times? OR, does it simply mean that any in any given game with a 10-point spread, the favorite will win 75% of the time by an average margin of 10 points. I think that the second statement is clearly the correct one.
The first statement is a fascinating concept in itself, as I think it is easy to fall into the trap of thinking that (for example) since MSU beat UofM last year, the 2017 MSU team would beat the 2017 UofM team 100% of the time if they played again. That is certainly not true. But, what I think the Vegas line does (in effect if not in intent) is to estimate this likelihood, based on all the information available at the time. By the end of the season, I think it is pretty likely that they get close to this.
Finally, for reference, here is the probability of victory data in tabular form:
Finally, I have one new piece of data to share, which is the likelihood of a given big upset, per year. The table above shows that once the spread gets over ~28 points, the odds of the underdog winning drop to under 2%. But, sometimes this is hard to understand how likely or unlikely this even actually is, especially since most of the contests in a give year have a spread that is with a 1-2 TDs. But, if you factor in the number of games typical in a given year with a given spread, you can create a kind of cumulative distribution function of the number of expected upsets observed in a given year, as a function of the spread. The chart is shown here:
An upset when the spread gets above 14 is a once a week type of occurrence (not shown). Once the spread gets over 25, we enter the realm of a "once in a season" event. A spread of 30 or higher is a once every 3 year event, and it goes up quickly. The "10-year storm" is a spread of 33.5, the 50-yr storm is a spread of 37.5, which just so happens was the opening spread for the biggest upset on record, Stanford's 2007 upset of USC.
That is all for now, enjoy the games this weekend!
Comments
Post a Comment