(Originally posted to various internet sources in July of 2015)
Math + sports = happiness, at least that is how I look at the world. I little while ago, I posted a very long analysis of the upcoming college football season based on a spreadsheet-driven simulation of the entire football season. In that article, I glossed over the details of the simulation, but for those that are interested, I have prepared this companion article that goes into the simulation in greater detail. In nothing else, I think there is some cool data about Vegas spreads in here.
Ironically, the spreadsheet was born on the campus of Purdue University in 1998, where as a graduate student I first created a file to track schedules and scores. Year-by-year, I would make steady improvements to the format to make the entry of both the schedule and scores easier. At some point, I added an equation to calculate a rudimentary power ranking based on the points scored in each game and the opponents winning percentage. A few years later, a made a personal breakthrough when I figured out how to make the equation recursive such that I could optimize the "power index" of each team. Later I would modify the schedule data to account for home field advantage. A few years after that I realized that I could run the equation in reverse to predict the score of future games. When I looked at the impact of home vs. neutral games, I found that using a full season of data my spreadsheet predicted that home field was worth about 3 points, which matches the general rule of thumb for Vegas odds. At this point, I knew that I was on to something interesting, so I started tracking opening Vegas lines for each game as well. As the data sets grew larger, I even injected some probability into the mix and devised a way to predict the probability of victory for each team in each games. This has finally now allowed me to calculate / estimate the chance that each team will win X number of games total.
First, I want to explain a little bit of how I calculate my power index for each team. As I mentioned, I only track the score of each game played by all 128 Division 1 team as well as the full schedule of each team (including the location of each game). The basic concept is based on essentially the idea that if Team A beats Team B by 10 points and Team A beats team C by 15 points, then team B should beat team C by 5 points. This principle to applied to every game in the season and the power indices are regressed to minimize total error. However, I do add a bit of a twist: I don't actually use raw points. I use the ratio of the point differential and total points scored in each game. So, if Team A beats Team B by the score of 42-14, Team A's power index for that game is +0.548 (28/(42+14)) higher than Team B's average power index (prior to any correction for the home field advantage of either team). Due to the ratio concept, running up the score has less of an impact and defense is generally rewarded. A team's total power index is then just the sum of the power index earned in each game. Because each team's power index is based in part of each of their opponents power index, a recursive formula is needed to "optimize" the power index of each team in order to minimize the total error. Early in the season, there is not a strong enough "connectedness" between teams to allow the data to converge, so it is necessary to "bias" the power index of each team to a fixed value, which I set using the preseason ranking of each team as found in various preseason publications.
Second, I wanted to say a few words about Vegas. In short, what they do there is nothing short of amazing. My database of opening Vegas spreads goes back to 2009 and now has almost 4000 entries. Over that span, there have been 3834 games with a non-zero point spread. 66 of those games were pushes (the outcome matched the opening spread exactly), and 1883 of those games saw the favored team beating the spread. That percentage is 49.973%, a difference of exactly ONE game out of over 3800 from a dead 50-50 split between the favored team beating the spread or not. As I said: amazing. My head tells me that the spread is set to get an equal amount of cash of each side of the line, but that stat just blows my mind. Vegas is very, very good at setting opening lines in the right place. (Incidentally, I always track the opening spread because it is a fixed number that is consistent and not influenced by betters, who I assume are unreliable, or at the very least inconsistent.)
That being said, the Vegas lines don't actually do a great job of predicting which team actually will win each game. Over the span of my dataset, the Vegas spread only gets the winner of a game correct about 75% of the time (75.5%, to be exact). Sports (college sports especially) in reality are unpredictable. But, as I collected more and more data on Vegas spreads, I realized that I could use the data to see how often a team would win if they were favored by a certain number of points. The data from 2009 to 2014 looks like this:
The data look like you would expect. As the spread approaches 0, the chance of the favored team winning approaches 50%. As the spread grows larger, the percentage of victory for the favored team rises to 100%. When I first saw this data, I added the trend line shown in the plot where the relationship between the spread and victory percentage is linear up to a spread of around 22 where the chance of victory is roughly 100%. But, there is still a lot of scatter in the data, and it is tough to know the actual trend. Then, I decided to box car average the data to smooth it out a bit. When I did this, the trend become a bit more obvious:
Instead of a line, the data has more of a bend to it and now asymptotically approaches 100% at a slightly higher spread of around 28-30 points (I simply used a quadratic equation here adjusted so that it reaches 100% at a spread of 30 and is 50% at a spread of 0). This makes sense based on the actual data as well. When I look at the raw W/L data for each spread "bin", there are no more than two consecutive bins where the favored team is undefeated up to a spread of 28 points. However, once the spread reaches 28.5 points or higher, there are over 200 games total since 2009, and an upset only occurred ONCE. That game was literally a game played in Week 1 involving a new Div 1 team, Texas State, beating Houston by 17 when they were a 38-point underdog (which is a bit of an unusual circumstance). The take home here is that there is a very simple correlation that exists which relates the opening spread to the historical probability of victory and once a team is favored by over 28 points, their odds of winning approach 100%.
Getting back to my spreadsheet, once I started to accumulate several seasons worth of data, I came to a bit of a harsh reality: my algorithm didn’t seem to do any better than Vegas does actually predicting the winner of any given contest. In fact, I do a little worse at 71.3%. But, at some point it dawned on me that perhaps I could use my algorithm to actually predict the Vegas spread, and as it turns out, the algorithm does do a fair job at this. The complete correlation between for the 2009-2014 data set looks like this:
Again, there is some scatter, but it looks a lot better than the correlation between even the Vegas Spread and the actual outcomes, which look like this:
In general, I am pretty happy with this performance. With just this simple algorithm including only score data adjusted by the location (home, away, neutral, semi-home, or semi-away), I can predict the opening Vegas Spread within a point 10% of the time, within 3 points 30% of the time, within 7 points 60% of the time, and within 10 points 75% of the time. The full distribution is shown here:
Once I started to combine and correlate my algorithm’s predictions, the Vegas Spread data, and the actual results of games, I discovered that I could do a lot of fun things. The obvious first question to ask is whether my algorithm can suggest which team to bet on for each game. As you might guess, I don’t think that I can beat the house with it (if I did, I wouldn’t tell you, now would I?) But, it does fairly well. Since 2009, with around 3700 games, I am currently 50.6% against the spread. It does vary a bit from year to year. In both 2012 and 2013, I was almost at 54% ATS, but last year was a little rough and I was only at 47.5% ATS. So, it is good enough to entertain myself and utilize for the occasional office Bowl Pool.
The second question I asked was whether or not I could use the algorithm to pick upsets. If Vegas picked one team to win and my spreadsheet picked the other team, who is correct? Once again, Vegas does do a little better than I do picking winners, but I have found historically that my spreadsheet will correctly pick an upset winner 40.6% of the time. This may sound better than it actually is, because the majority of these upset picks occur for games where the spread is 2.5 points or less (where the odds of the favorite winning are close to 50% anyway) , but my calculations suggest that I do a little better than random chance here. If nothing else, it again makes for good fun. This year, my plan is to share my upset picks on Twitter. Hopefully the algorithm won’t make me look bad. Stay tuned.
The final question I asked myself is whether I could use all of this data and spreadsheet infrastructure to predict or simulate the upcoming season. Over the last few years, I have used my free time during the summer months to create the new spreadsheet for the year, and use it to analyze the upcoming season. My earlier post explains my findings for this year in much greater detail. As for the simulation itself, the tricky part here is that some sort of input data is needed to seed the spreadsheet on how good each of the 128 teams are going to be. Now, there is no mathematically rigorous way to do this, in my opinion, so I have historically simply pulled the preseason rankings from the various preseason magazines (Phil Steele, Athlon, ESPN, Lindy’s) and averaged the results to get a "consensus" preseason ranking. That part is easy, and even though it will certainly not be accurate at the end of the year, it is the best place to start. But, the hard part is to assign a power index value to each rank. (My power indices tend to scale from roughly 1.5 to 2.5 depending on the year, and the rankings are of course from 1 to 128). The obvious thing to do would be to use the data from past years and use a simple correlation. In general, I believe that this is the "correct" way to go. I can apply this historical correlation to generate each team’s power index, and the spreadsheet will project a point spread for each game played by each team. I can even use the Vegas spread correlation to convert the spread to a probability of victory. If I sum up the probabilities for each individual game for a given team, I can generate an "expected value" of conference or overall wins (which should relate to the team’s over/under for the year). A few years ago, I did even more math and figured out a simple way to calculate the probability that each team will win a given number of games total (i.e. the probability to win all 12 games, or 11 games, or 10… and so on). In the season analysis, I refer to this simulation as my "probabilistic simulation."
With this data in hand, it is quite easy to simulate the results for all games and spit out standings by assuming that any team with a greater than 50% chance of winning a game is going to win, by the exact amount of points projected by the simulated point spread. Basically, this is assuming that the Vegas spread (predicted) is the actual result of each game. However, as I have already pointed out, Vegas only can predict the correct winner about 75% of the time. So, when I run the simulation using the historical correlation of rank to my power index, the results are way too conservative. Basically, the favored team in each league blows through everybody, except in the occasion where they have to go on the road to play a slightly lower ranked team. In this case the home field advantage is enough for the home team to pull out the victory. Actually, I have developed a correlation for this as well (rank differential based on post-season data vs. chances of winning either while at home or on the road). The general conclusion is that if a "better" has to go on the road to play a team that is only 5-10 slots below them in the rankings (if we assume it is a true ranking of team strength on a neutral field) they are more likely to lose than to win. In contrast, a higher ranked team playing at home is a team ranked 10 slots below them, the chance of winning is around 75%. The full data are shown here:
But, I digress. In general, the "correct" simulation predicts that far too many teams will remain undefeated and in general makes no prediction that is actually that interesting. However, I have found that if I adjust my correlation of power index to ranking in a way that artificially decreases the separation between teams, the resulting simulation gives results that are a little bit better match for what you would typically see in an actual season. Essentially what I do is to adjust the level of parity within college football. I still assume using this model that a high ranked team playing a lower ranked team at home will always wins, which we know does not match with reality. But, the majority of ranking upsets (60% by my count) do occur on the road. This method allows me to run the simulation of the season several times using different levels of parity to see how the outcomes vary. In general, I have found that a moderate amount of parity produces a result that matches reasonably well with reality (For example, maybe 1-2 teams will go undefeated instead of 5-6, and roughly 20% of the games result in ranking upsets on the road.) In my season analysis, I refer to this as my "base" or "most trustworthy" simulation. If nothing else, this methodology is a self-consistent way to evaluate the role that each team’s schedule plays in the final results of the college football season.
(As an added bonus, I show here the full table of expected values for all 128 teams. Enjoy!)
Comments
Post a Comment