For those of you who like me love their data almost as much as college basketball, you have probably noticed the 2022 Schedule link is already available on the site. This is our traditional Excel spreadsheet with the standings and rankings in one worksheet and the full NCAA schedule with scores in another worksheet. I have continued with the approach that conference tournament game are counted as conference games instead of post-season games. As we did last year, I have merged the NET rankings from the NCAA website to have the updated rankings on the spreadsheet, along with listing of records for the 4 quads. I then have used that information to validate that my records are correct. It is a relatively simple check to match the records, home records, away records and neutral court records between what the schedule data says and what was in the NET rankings, so take the data for what it is…….
The NCAA Archive site used to have this great document called Team Pages – it had a page for each team with their computer rankings (including NET, KenPom and Sagarin scores) and then all their schedule listed by quad. This isn’t as good, but the 2022 Nitty Gritty link under research actually goes to the NCAA Statistics website with a summary of key statistics about each team. If you click on one of the team’s names, it takes you to their schedule, and if you click on their NET Ranking on the schedule page, it will give you a team sheet that is similar to what they provide in the PDF. The only thing I don’t like about that is it doesn’t have all the other computer scores (the old PDF used to have KenPom and Sagarin scores for example on their team sheets). It also isn’t in one large document, which is annoying. If I find a better option, I will add information as we go……
But that does have the benefit that there is now only one file to update. No promises that I will update the Schedule Excel file every day, but wanted it to be available to everyone as early as possible. I will update as I have time throughout the week – the spreadsheet has a page that says when it is last updated.
Obviously, remember the traditional Lunatic disclaimers. I have done some basic cleaning and quality checks that the records from the schedule I have match the official NCAA site – but there are thousands of games, and so I will not make the claim that I have checked every piece of the dataset. To be honest, I simply check to make sure the records match – I figure if I can get lucky enough that all 347 teams have the correct records, the rest of the data is probably right.
That being said, one really interesting thing that this file does create is a side-by-side comparison of the old RPI calculation (which my tool still calculates – as does some other webpages) vs. the new NET model that the Selection Committee is using to rank games into the quadrants. I do think that the NET score is giving the Selection Committee a better ranking, even if I still don’t understand it.
My example for the year is Rutgers. I get that they had a tough November, but lets look at records by quad. Rutgers is 6-5 vs Quad 1, 3-4 vs Quad 2, 4-2 vs Quad 3, and 5-1 vs Quad 4. That leaves them in 76th place. Lets compare against Ohio State. The Buckeyes are 5-5 in Quad 1, 5-4 in Quad 2, 6-1 in Quad 3 and 3-0 vs Quad 4. Both teams finished 12-8 in the Big 10. I get that Rutgers has those 2 extra bad losses from November. But it is hard for me to believe that Ohio State is the 22nd best team in the country, and Rutgers with the same Big 10 conference record is 76th. Regardless, I still have to think that the NET score is better than the old RPI rankings.
For those of you who are not familiar with this tradition of me doing insane data pulls to grab all this great college basketball data, I will give you some more details.
As many of you know, one of my insane features is that I try to provide people with data about the teams in case they want to do research on the teams. Each year, we get several people who have demonstrated the power of statistics by building models in order to predict the games. Some of them have been extremely successful with this – especially Bill Kahn with his Bradley-Terry models, showing that even something extremely unpredictable as sports can be forecasted through good statistical techniques. But the part of this that has made me happy – and why I do this – is because a few people who were not statisticians but were taking a stats training course at work used this data for their class project and ended up having some success – including our 2006 champion, David Shaddick.
So, since that point, I decided to provide the scores to everyone in an attempt to provide people as much of a chance to try to leverage data to make their decisions. I realize that most of you will probably spend three to five minutes just looking at the teams and figuring who will do best – I probably don’t need a model to decide that the number 1 seeds will beat the 16 seeds… In fact, I typically spend so much effort maintaining the site that I pick Purdue to go far and just randomly pick the other games late Wednesday evening.
However, if I can give people a chance to try to learn something about statistics in a very fun environment, it is well worth the effort.
If you notice something terribly wrong, let me know – no promises I have time to fix it, but at least everyone will know.
Enjoy the data!!!!