Around this time of year, the various pre-season college
football publications start appearing on the shelves. For some time now, I have
often wondered if there was a good way to evaluate how good or bad the various
preseason rankings really are. This
year, I decided that I would try to figure it out. Now, it would straightforward to simply
compare the various preseason rankings to the post-season CFB playoff ranking,
AP ranking, or coaches poll. But, that
only tells the story for about 1/3 of all Division 1, and I was looking for
something a bit more comprehensive.
From time to time, I have discussed and posted data based on
an algorithm that I have developed to generate my own power rankings. Since my method does assign a ranking to all
128 Div. 1 teams, is typically a reasonable predicter of Vegas spreads (more on
that later), and since I also tabulate preseason predictions from various
sources to support my annual preseason analysis (coming soon to a message board
near you), it occurred to me that I had all the data that I needed to make this
comparison. So, I went back over the data from the last 10 years or so and
compared various full 128 team preseason rankings (from sources such as Phil
Steele, Athlon’s, Lindys, ESPN (FPI), and SP+) and tabulated the average
absolute difference between their rankings and my algorithm’s post-season
rankings for all Division 1 teams. The
results are shown below:
Now, as you can see, I do not have a perfect data set to
work with. I only have multiple source rankings for the last 5 years, and you
also must trust that my algorithm is reasonable approximation of the relative
strength of teams. In any event, there
are several interesting observations from this table:
First, for the limited data that I have, Phil Steele’s
publication appears to give consistently smallest error between the preseason
and my simulated post-season rankings. I
have his data as the best in 4 of the 5 years where I have rankings from
multiple sources. He always advertises that his rankings are the most accurate,
and I cannot dispute that with this analysis.
Second, that said, there is not a huge difference between the different
publications. So, there is no strong
reason to rush out and buy any one of these publications over the other based
on the rankings alone (I will comment a little more on this later). Third, all
the publications don’t seem to get that close to the final rankings. The average deviations are all in the range
of 15-20 slots which is an average error of ~15%. That does not seem great to me.
I wanted to dive a little deeper into the third point. As the table indicates, I have the most
historical data on Phil Steele’s rankings, so I decided to go back ten years
and compare the all of his preseason rankings to all my post-season
rankings. There are several ways to look
at this data, but I find the most informative to be a histogram of the
deviations, a scatter plot, and a plot of the average post season ranking as
function of the initial Phil Steele ranking (basically the scatter plot data
where the y-axis instead contains the average and standard deviation / error
bars for each rank instead of each individual data point). Once again, there
are several conclusions we can draw from this data.
First, the histogram gives us an idea of the distribution of
the variance. It is fairly bell shaped, with 24% of the picks falling within
+/- 5 slots of final rankings, and 41% falling between +/- 10 slots. But, the tails of the distribution are also
fairly long. 23% of all of Steele’s
picks are not within 30 slots of the final ranking. The scatter plot tells a very similar story
and in this case, we can see that the correlation (R squared = 0.66) is OK, but
not that great. The scatter plot also tends to highlight the real misses, like
when Steele ranks a team in his top 20 (like Illinois in 2009) but then this
team winds up 3-9 with a ranking in the 80s by my algorithm, or when teams like
Utah St and San Jose St. in 2012 are ranked around 100 by Steele, but wind up
ranked in the top 25 by my algorithm and the national polls. The plot of the average ranking vs. initial ranking
data shows the Phil Steele data in perhaps the best light. This plot shows that for any given ranking, on
average, Steele is pretty close, but the error is still quite large. Notably, the deviation is much smaller for
teams in Phil Steele’s ~Top 5.
Historically, those teams do usually wind up having great seasons, but
there are exceptions (like the 2007 Louisville team, which started ranked #4,
but who ended 6-6). That said, the same
trend is also found at the bottom end of the chart, so it might have more to do
with the fact that teams ranked high (or low) only really have one direction to
go: down (or up). That fact is best illustrated
by a plot of standard deviation of post season ranking as compared to preseason
rankings (basically, the plot of the error bars as a function of preseason
rankings) which is show here with a clear parabolic trend.
What is perhaps the most interesting aspect of all of this to
me harkens back to my second observation shown in the 1st table
above: the fact that the deviations from the different publications are all
basically the same for a given year. To
visualize this, I plotted the predictions from the two publications for which I
have the most data tabulated (Phil Steele and Athlons) and plotted that in a
scatter plot, which is shown here:
Not surprisingly, the correlation between the two
predictions is rather high (R-squared = 0.91) and much higher than the
correlation to reality, so to speak. So,
as my first conclusion, I think that we can say that pre-season predictions are
OK, but not great (they are certainly not destiny) and they agree with each
other far more than they will agree with the actual results on the field.
This analysis led me to think about another interesting
topic which is related to the first. Now
that we have looked at the robustness of preseason rankings, what about
in-season predictions? More
specifically, what about metrics such as EPSN’s vaunted FPI? In the 2016 season, I decided to put the FPI
to the test alongside my own algorithm to see how they performed. As it turns
out, this is a tricky question because defining “performance” in this context
is not as easy as you might think. A big
part of the reason why is that there is generally a very poor correlation
between any predicted margin of victory and the actual result. The best predictor, I suppose not
surprisingly, is the Vegas Spread, and a plot of the scatter plot of the actual
game margins vs. the opening Vegas spreads for the entire 2016 season is shown
here. As you can see the R-squared is a
pathetic 0.214. But, this is better than
the FPI, which only mustered an R-squared of 0.196 and, sadly, my algorithm,
which only mustered an R-squared of 0.167.
I won’t bother to show you those plots, as they both look like shotgun
blasts.
Last year as I poured through the FPI data, I noticed
something odd: it was quite rare for the FPI to predict a Vegas upset. I only counted 37 predicted upsets total out
of over 750 games (5%), which is interesting because historically about 25% of
all college games wind up being upsets per Vegas. 2016 saw over 200 upsets total. My algorithm picked over 80 upsets for the
season. Granted, it was only right concerning
the upset 37% of the time (which is below my algorithm’s historical average of
40%), but the FPI only got 46% of its upset picks correct. When I plotted the full year projected
margins from the FPI versus the Vegas Spread (See below), you see that the
correlation is quite good (R-squared = 0.86).
By comparison, my algorithm did not do quite as well, but it still fairly
highly correlated (R-squared = 0.72).
From all of this, I come to my second main conclusion from
all this analysis: In-season algorithms
don’t do a good job of predicting the outcomes of actual games, but they can do
a good job of predicting the Vegas spread.
In this regard, the FPI (and to a lesser extent, my algorithm) does have
value in doing things such as projecting point spreads out 2-3 weeks in
advance. That type of analysis is appears
to be fairly robust. I also must concede
that the FPI does a better job of predicting these spreads than my algorithm
does (which I would expect considering they most likely have more than one dude
working on it in his spare time). But,
you could argue that the FPI is so good at predicting the spread that it
doesn’t add much to the discussion. It
is on some level too conservative. At
least my algorithm takes some chances and will make more than 1-2 upset picks a
week. But, at the end of the day, the
gold standard is the Vegas spread, which honestly makes sense. After all, if there was a computer program
out there that could beat Vegas, somebody would be very rich and they would
certainly not tell the rest of us about it.
So, with this knowledge, perhaps the most useful figure that
I can leave you with is the following:
the 5-point boxcar averaged plot of the probability of the favored team
winning as a function of the opening Vegas spread for all college games back to
2009. As you can see, if the data is
smoothed, it forms a nice quadratic curve from a 50-50 toss-up to a virtual
sure thing once the spread reaches around 30.
(In reality, there have been a total of 2 upsets in games where the
spread exceeds 30 since 2009, but the frequency is less than 1%). The fit is not perfect, but the equation on
the chart is very simple and easy to remember.
I would imagine the line should asymptotically approach 100%, but never
actually reach it, because in college football, I believe the underdog always
has a chance.
Comments
Post a Comment