Stats-Based Bracket Prediction: A Retrospective

A few days after Selection Sunday this year, I posted a very detailed analysis of the 2019 NCAA Men's basketball bracket, including several predictions. I made my picks this year based on an analysis of Kenpom efficiency data, as well as on historical upset trends. I entered a few different on-line pools using slightly different versions, and I wound up winning each pool going away.

Among other things, I corrected predicted Auburn to win the Midwest Region, and I correctly predicted Virginia was going to beat Texas Tech in the Final. I had never before done such a through analysis of the tournament, and the next obvious question was: did I just get lucky this year, or is this a method that can be used in the future, with some expectation of success? In the weeks since the tournament ended, I have done an even deeper dive into the data, and I think I have the answer, which is "Yes... and No."

First off, I should say a little about the methodology that I used this year. When you think about any given NCAA tournament, you can look at the historical data to get a feel for what is likely to happen. For example, it is easy to look at stats on the 5-12 match-ups and conclude that an upset is going to occur in roughly one-third of all contests. It is also easy to look up the numbers that tell you that at least one 1-seed is going to make the Final Four, a second 1-seed will join then about half the time, and getting 3 or more 1-seeds to the final weekend has only happened 5 times in history.

But, the trick is to pick the correct 5-12 upset(s) and the correct 1-seed win their Region (among other things). That, of course, is the hard part. But, advanced stats, such as Kenpom efficiency data, might be able to provide some hints. After all, not all 5-12 match-ups are the same. Sometimes, there is a strong 5-seed matched up against a weak 12-seed, or vice versa. So, my theory was that making this comparison would be helpful in making my picks.

Along the way, I came up with a visualization tool that I found quite helpful. First, I used the historical Kenpom data to calculate both the average and standard deviation of the Kenpom adjusted efficiency margin for each seed, 1 to 16. I then made a plot of each Region, showing each team's efficiency margin relative to the historical average for their seeds. As an example, the chart for this year's East region is shown below.

Just looking at this chart gives quite a bit of insight into this region. For example, while Duke was very much an above average 1-seed, MSU was an extreme outlier as a "good" 2-seed. In fact, MSU was the 2nd best 2-seed in the Kenpom era (only behind the 2015 Arizona team). So, it stood to reason that MSU and Duke would meet in the Regional Final and that the game would be close.

Looking farther down the diagram, the Kenpom data suggested that LSU was not going to be much of a threat to MSU (which they weren't), but that Virginia Tech might give Duke some problems (which they did). Moreover, the fact that LSU was a weak 3-seed facing a strong 14-seed in Yale in the first round led many people to take Yale over LSU as the big 1st round upset. That almost happened as well.

But, there were also a few surprises. Louisville looked like a clear favorite over Minnesota. There was absolutely no indication that Liberty would give Mississippi State trouble, let along beat team, and Bradley certainly didn't look like a scary 15-seed. We all know recall how those scenarios played out.

At the end of the day, however, the important thing is robustness. Is this type of analysis going to work as well in the future? Fortunately, 2019 is not the only year where Kenpom data is available. He has CSV files on his website including data back to 2002. So, it is possible to go back through the past 18 tournaments, apply a similar logic, and see how things turn out. I have spend a fair amount of time since the final buzzer in Minneapolis doing just that.

On a first pass, I used the same data set as shown in the chart above to calculate the odds that the favorite wins (or, conversely, an upset occurs) for an "average" game between any given combination of seeds. For example, I took the average Kenpom efficiency of a 5-seed and the average efficiency of a 12-seed, computed the average point spread, and from that, the odds. As it turns out, this calculation correlates quite well (with a couple exceptions) to the observed result in actual tournament games, as shown here:

I then went through all tournament games since 2002 and calculated the predicted odds of each game being a "seed upset" using the pre-tournament Kenpom efficiencies. I scaled this data relative to the average expected odds for that particular seed pairing. Finally, I compared this data to the odds of an a seed upset actually happening in the real tournament. That correlation is shown here:

This is very good news for my theory, but it should not have been a surprise. I have made the observation previously that the NCAA tournament follows the same probability behavior as the regular season when it comes to the odds of winning or losing as measured by the Vegas spread. It only stands to reason that contests were the Kenpom projected spread is tighter than expected would be more likely to end up as an upset. That is essentially what the above chart is telling us.

But, this data by itself is a bit hard to use, because it lumps together all tournament games, whether it is a 1-16 game or an 8-9 game. So, in order to make this data more useful, I came up with a strategy to define a certain threshold of probability for each possible match-up to decide whether an upset is likely or not.

Without going too far down the math rabbit hole, the basic idea is as follows. To once again use the 5-12 match-up as an example, as I stated above, this match-up results in an upset about one-third of the time in history. If I use the average Kenpom efficiencies for 5-seed and 12-seeds, I get the same average probability of an upset within a percentage point. So, if I want to try to predict about the right number of 5-12 upsets and maximize my odds of picking them correctly, I would target the third of those games with the narrowest spread.

As a more general tool, I used the Kenpom data to generate a distribution of mathematically consistent odds for all possible combinations of seeds, and then set the "odds threshold" in the same manner as I described above for the 5-12. As I define it, if this "tuning parameter" is equal to 1, then I would be predicting an upset at a rate that is consistent with what one would expect in reality (for example, one-third of all 5-12 match-ups). A threshold of 0 would be equivalent to only picking an upset in the cases where the Kenpom odds were actually greater than 50%. In other words, in the cases where Kenpom efficiencies suggested that the lower seed was actually the stronger team.

The beauty of this method is that it allowed me to set up a simple algorithm to automatically make picks through all rounds in any given NCAA tournament in history in the Kenpom era (2002-2019). Furthermore, I could tighten or loosen the odds threshold using the tuning parameter to see the effect on the predictive power of this method. In order to test drive this methodology, I first looked back at the actual results from the 2002-2019 tournaments.

I came up with the follow way to try to visualize what is going on. For a given value of the tuning parameter (usually 0 to around 2), the algorithm will automatically classify certain games as likely upsets. This prediction is either correct or incorrect. Furthermore, there is a complementary set of games where the algorithm does NOT predict as upset. But, sometimes upset occur in these games as well. I used the observed upset rate in both sets of games (predicted and not predicted) as my two axes (y and x, respectively). I also decided to cluster the data according to the seed pairings, and I only selected the 20 most frequently observed parings.

If I organize the data in this way, there are three distinct zones. First of all, I would hope that in most cases the set of data where my algorithm predicted an upset shows a higher percentage of upsets than the non-prediction data set. In the cases where it doesn't, upsets are essentially "unpredictable" using this method. This difference can be delineated by a diagonal line where x=y. But, not all "predictable" predictions are equal. If you think about it, in order to really gain an advantage in something like an office pool, the predictions need to be correct over half of the time. For example, if you make 8 upset picks, but only 3 are correct, a "chalk" bracket would have two more correct picks. This difference between a truly useful prediction, and a "sort-of useful" prediction can be visualized using a horizontal line at 50%.

The chart below shows the odds for different seed pairings using a tuning parameter of 0 (where upset picks are only made when projected Kenpom odds dip below 50% for the seed favorite). Each data point is labeled with the seed pair, followed the fraction of correct upset picks to total upset picks for that pair, back to 2002.

There is a lot to unpack here. One fact of note is that for all tournament games back to 2002 (about 950 games), in 101 of them Kenpom actually had the lower seed as the better team. In this subset of games, the seed "underdog" won 56 of those games (55.4%). So, it should not be too much of a surprise that 7 of the seed pairing fall into the "predictable" area of the plot above. Notably, using a tuning parameter of 0 was very effective in predicting upsets for the 1-4, 1-2, 3-11, and 4-5 seed match-ups, resulting in a record of 15 for 19 (79%). This strategy also did very well in the 6-11, 7-10, and 8-9 match-ups, although the rate of correct predicts from this group was much closer to 50% (34/61 or 56%). But, those are numbers that can seriously help in an office pool.

For the 3-6 match-up, this method did not do as well (2/6), but this is still slightly better than the upset rate for the rest of the 3-6 match-ups. On the other side of the diagonal line are the 2-3, 2-7, and 5-12 match-ups, where using this method was essentially worse than just picking upsets randomly. For the remaining match-ups, this method did not predict any upsets.

Here is as good of a place as any to mention one conclusion that I drew from my study of this data. If you are looking for a mathematically sound way to try to "beat the odds" when it comes to making NCAA tournament picks, the method above (picking all Kenpom favorites) is essentially the best strategy. This is the only systematic way to likely get more than 50% of your picks correct. This might seem obvious, but I actually did a whole bunch of simulations to convince myself that it is true. It does make sense, as in order to be right more than 50% of the time, you have to bet on things with better than a 50% chance of happening.

As I alluded to above, this strategy would recommend only picking a handful (5-6) of seed upsets in any given tournament, and only a little over half of those picks would pan out. The historical data also suggests that this strategy would only correctly predict 20% of the upsets would occur. In other words, it is a very boring and conservative strategy.

But, the simulations that I ran did suggest that for seed pairings where the average odds of an upset were not too low (over 25% or so), that there would be enough outliers with better odds (closer to, but not over 50%) that there was still a reasonable probability of getting more picks correct than incorrect. So, I went back to the historical data from 2002 to 2019 and looked at the effect of raising the tuning parameter (in effect making increasingly bold upset picks in a mathematically consistent manner). In this analysis, I was mostly concerned with two metrics. First, the percentage of correct upset picks (with the goal of staying over 50%) and the share of correct upset picks relative to all the upsets that actually happened (with the goal of getting this number as high as possible). The results of this analysis are shown below:

As one might expect, the higher the tuning parameter, the larger the share of correct upset picks (the red line). Interestingly, the percentage of correct upset picks hovers right around 50% up until around a value of 0.6. After that, it drops pretty linearly. So, it seemed to me that setting the tuning parameter to a value of about 0.66 would be a nice balance of "correctness" and "boldness." I must admit that I have no idea why ~0.66 seems to work, but it might relate to the fact that ~66% of a given set of data lies within one standard deviation of the mean, if the data is normally distributed. Sure. Why not?

In any event, I adjusted the tuning parameter to 0.66 and re-plotted the historical data on the "predictability chart." The result is shown below

Interestingly, this strategy does seems to be an improvement of sorts over the more conservative strategy discussed above. A full half of the seed pairings fall into the "predictable" zone and 6 of them (1-8, 1-9, 1-3, 2-10, 4-13, and 4-5) give correct upset predictions over 70% of the time. 4 additional seed pairings fall into the "sort-of predictable" zone, including the notorious 5-12 match-up and likely 1-seed Sweet 16 contests. The remaining 6 seed pairings are fall into the "unpredictable" zone. These include the 8-9, 2-3, and 3-6 contests, as well the first round games for 1-, 2-, and 3-seeds, which all went "o-fer" using this strategy.

But, based on the simulations that I ran, this is perhaps not a surprise, since the odds of a 14, 15, or 16-seed winning in the first round are all below 20% on average. This is below the odds where the simulations suggested that my systematic approach would work. If I use the 3-LSU / 14-Yale game from this year's tournament as an example, it was certainly the most likely potential "big" upset. However, what this meant in reality was that the odds of an upset were up to 20% instead of just being 15%. Either way, it was a long shot.

In contrast, the average odds for a 4-13 upset is around 22%, which is right on the borderline of the point where my method starts to make sense, and over 10 of those contests since 2002 have had odds in the 30%-40% range. This is a range where you would expect to see a handful of upsets, and that is exactly what is observed.

In order to complete the picture, I also took a look at what happens to the predictability chart if I crank the tuning parameter up to 1.0. As I explained above, this is the value where the total number of predicted upsets should approximate the number actually observed in tournament play. Sure enough, at this value the algorithm predicts a total of 271 upset from 2002 to 2019 while the actual observed number is 269. The predictability chart is shown below.

As expected, the accuracy of this strategy is notably smaller than the strategy using a tuning parameter of 0.66. There are now only 7 seed pairings solidly in the "Predictable" range, as the 1-8, 1-9, and 4-13 pairings fell into the "Sort of predictable" zone and the 3-11 pairing is now right at 50%. The same Unpredictable pairings are still unpredictable. On a positive note, the 5-12 pairing is now barely in the "Predictable" zone, and the algorithm made it's first (and only) correct "big" upset prediction: picking 15-Lehigh to upset 2-Duke in 2012. But, it had to make 16 bad "big" upset picks to finally get one correct.

So, with this analysis in hand, I went back once again to the historical brackets from 2002 to 2019 to see how my algorithm would have performed if I applied it on Selection Sunday. I used a tuning parameter of 0.66 as a baseline, and I did make one other correction: I eliminated the possibility of a 1-, 2-, or 3-seed losing in the first round. Based on the data above, I now firmly believe that these upsets are essentially unpredictable.

I thought about also scaling back the tuning parameter on some of the specific seed match-ups that fall into the "sort-of predictable" and "unpredictable" zones. I certainly would get better results that way. However, that feels a bit betting on dice that you know are loaded. Furthermore, while I have mathematical reasons for not picking a 15-2 upset any more, I cannot come up with any reason why picking a 2-3 upset is so difficult. I tend to believe that it is just normal statistical variation, and I see no reason why this would continue to be the case over the next 18 tournaments. Overall mathematical consistency is more important, so I decided to stick to just a uniform tuning parameter just to see what happens. As an aside, I did not bother to try to pick the eventual champion or any of the winners in the National Semifinals round. I decided to focus exclusively on the result of each Regional.

There are several ways to present the results of this experiment. First, I tabulated the number of correct picks that my algorithm made in the Sweet 16, Elite 8 and Final Four rounds and compared that to the number of correct picks that would have been made if I would have just taken the highest seed in all match-ups, AKA a "chalk" bracket. The comparison is shown here, where the number after the round label corresponds to the "chalk" team in that part of the bracket (i.e "S16-1 is the part of the region with the 1, 8, 9, and 16 seed):

As you can see, my algorithm's picks mirror closely the number of correct picks using the "chalk" strategy, but the chalk bracket does do slightly better. This is true for all positive values of my tuning parameter. I can barely beat chalk if I set the tuning parameter at the super conservative value of -0.15, but what fun is that?

As for Final Four picks, the distribution of seeds in actual Final Fours, my four Final picks, and my correct Final Four picks are shown below.

The seed distribution for my algorithm's picks match reality pretty well, but only 26 of the 72 total picks (36%) are correct, which is a hair below the number of 1-seeds that advanced (27) over that time period.

On some level, this is a bit disappointing, as it implies that just picking straight chalk is better than using my newly developed algorithm. That is technically true. But, I would counter with the argument that although this algorithm is not fool-proof, it is a valuable to use to help make picks, when it is combined with other factors, such as the eye-ball test, Vegas spread data, historical trends, injury information, etc. If you combine this with a little good fortune, it can be a winning combination.

In this year's tournament, when I made my picks, this is basically what I did. I did not know it at the time, but my algorithm would have predicted Virginia and MSU to the Final Four. Auburn did look like a potential spoiler and Texas Tech looks a bit dangerous, but just using the algorithm would have prompted me to take Gonzaga and UNC. But, I knew three 1-seeds almost never make the Final Four, and I thought teams like Michigan, UNC, and Kentucky were all a bit over-rated. Furthermore, sometime Kenpom does not tell the whole story. Kenpom really liked Wisconsin, but the opening Vegas line for the 5-Wisconsin / 12-Oregon game was much tighter than I expected. So, in some versions of my bracket, I (corrected) picked Oregon to make the Sweet 16.

In my analysis of this data, I came across one other observation that bears mentioning. When looking at a bracket to try to figure out who is going to advance to the Final Four, check to make sure that the 1-seed is actually the highest ranked team in Region according to Kenpom. If they are, then they have a shade under a 45% chance to make the Final Four. However, if the 1-seed is NOT the highest ranked team, their odds to win the Region plummet to 1 in 6 (16%). Interestingly, the same basic math applies to 2-seeds. If the 2-seed is the best team in the Region (according to Kenpom), that team makes the Final Four 43% of the time. In other cases, the 2-seed's odds are also right around 16%. I can't really provide a mathematical reason for this, but it certainly is notable.

With the remaining space that I will take up, I just wanted to highlight some of the good predictions that the algorithm would have made through history. In most years, the algorithm only gets 1 of the Final Four teams correct. However, there are a couple notable exceptions. In 2012, it got 3 out of 4, and ironically, the team it got wrong was MSU (who was upset by Final Four bound Louisville in the Sweet 16). The algorithm actually went 4 for 4 in 2005 by correctly picking the 5-seed MSU's upset over Duke and Louisville's run to the Final Four as a 4-seed. The algorithm also went 4 for 4 in 2008, the one and only time all four 1-seeds made it to the final weekend. While this prediction may seem obvious, this is the only time over the 18 year span when the algorithm picks all four 1-seeds.

As for MSU, it is also interesting to look back at the last 18 years to see which of MSU's deep runs (or early exits) were "predictable" and which ones weren't. I have already noted that MSU's Final Four run in 2005 was surprisingly predictable, while the fairly early exit in 2012 was not (although we must remember that Brandon Dawson was injured in the Big Ten Tournament that year, and that definitely impacted that team). MSU failed to make it out of the 1st round in 2002, 2004, and 2006, and my algorithm correctly predicted each of those losses to the hands of NC State, Nevada, and George Mason.

In 2003, MSU made it to the Regional Final (only to lose to Texas). In this case, the algorithm did not predict a win over 2-seed Florida that year, although MSU was an above average 7-seed and Florida was a below average 2-seed. In 2007 and 2008, MSU won exactly the number of games that the algorithm predicted. Then, things get interesting.

In 2009 and 2010, MSU made the Final Four. But, the algorithm did not have MSU going anywhere near that far. In 2009, the algorithm had MSU just barely beating USC (which is basically what happened) and then getting sent home by West Virginia (not Kansas) in the Sweet 16. Similarly, the algorithm had MSU bowing out to Maryland in the 2nd round in 2010 (which also almost happened). I was not surprised to learn that MSU over-achieved in 2010, but the 2009 data is a bit surprising. I can only posit that MSU was a bit of a late bloomer that year and their early season performance pulled down their Kenpom numbers relative to the real situation in March.

In 2011, the algorithm felt that MSU really should have beaten UCLA, but that was it. In 2013, the Sweet 16 loss to Duke was expected. In 2014, the algorithm (just like everyone else) correctly picked MSU to upset 1-seed Virginia and picked MSU for the Final Four. But, UCONN blew up that prediction. In 2015, MSU was a very strong 7-seed, but 2-seed Virginia and 1-seed Villanova were both very strong for their seeds as well. So, that run to the Final Four was certainly unexpected.

And then, there was 2016... MSU was actually picked to make the Final Four, but instead, Middle Tennessee happened. Although the Red Raiders were an above average 15-seed, this upset was completely unpredictable. In 2017, MSU was not expected to beat Miami in their 8-9 game, while in 2018, MSU should have at least made in into the Sweet 16. Finally, there was 2019, when my algorithm once again correctly picked MSU to make the Final weekend.

For those scoring at home, my algorithm correctly picked MSU's number of tournament wins 8 time since 2002. MSU overachieved the algorithm's picks 5 times and underachieved an additional 5 times. All in all, that seems pretty good.

That does it for now, and with that, I think it is time to take a break from basketball stats. Enjoy the summer, and be on the look out for some football number crunching in a few months.

Go Green.

Dr. Green and White Sports Authority

Search This Blog

Stats-Based Bracket Prediction: A Retrospective

Comments

Post a Comment

Popular posts from this blog

March Madness Analysis: Did the Selection Committee Get it Right in 2025?

Dr. Green and White Helps You Fill Out Your Bracket (2025 Edition)

MSU Hoops Odds Update: Spartans on the Cusp