For those of you who like me love their data almost as much as college basketball, you have probably noticed the 2023 Schedule link is already available on the site. This is our traditional Excel spreadsheet with the standings and rankings in one worksheet and the full NCAA schedule with scores in another worksheet. I have continued with the approach that conference tournament game are counted as conference games instead of post-season games. As we did last year, I have merged the NET rankings from the NCAA website to have the updated rankings on the spreadsheet, along with listing of records for the 4 quads. I then have used that information to validate that my records are correct. It is a relatively simple check to match the records, home records, away records and neutral court records between what the schedule data says and what was in the NET rankings, so take the data for what it is…….
The NCAA Archive site used to have this great document called Team Pages – it had a page for each team with their computer rankings (including NET, KenPom and Sagarin scores) and then all their schedule listed by quad. But unfortunately, I have not found anything that is like that, so I am going to let you simply use the Excel document that I have created if you want something similar to the team pages from before.
I will also give you a link to the NCAA Statistics site – this is where I pull all my data from, and it has some nicer views if you want to simply look at the data in a web browser instead of Excel. Unfortunately, this doesn’t give you the details of things like the KenPom, Sagarin, and BPI scores that provide other statistical analytical rankings. But as I have mentioned, things are a little crazy for the Lunatic, and so we will likely stick with this. If I come up with a better option, I will provide it, but most likely, what you see right now, is what you will get.
But that does have the benefit that there is now only one file to update. No promises that I will update the Schedule Excel file every day, but wanted it to be available to everyone as early as possible. I will update as I have time throughout the week – the spreadsheet has a page that says when it is last updated.
Obviously, remember the traditional Lunatic disclaimers. I have done some basic cleaning and quality checks that the records from the schedule I have match the official NCAA site – but there are thousands of games, and so I will not make the claim that I have checked every piece of the dataset. To be honest, I simply check to make sure the records match – I figure if I can get lucky enough that all 363 teams have the correct records, the rest of the data is probably right.
That being said, one really interesting thing that this file does create is a side-by-side comparison of the old RPI calculation (which my tool still calculates – as does some other webpages) vs. the new NET model that the Selection Committee is using to rank games into the quadrants. I do think that the NET score is giving the Selection Committee a better ranking, even if I still don’t understand it. I will probably blog about some of these differences later in the week.
For those of you who are not familiar with this tradition of me doing insane data pulls to grab all this great college basketball data, I will give you some more details.
As many of you know, one of my insane features is that I try to provide people with data about the teams in case they want to do research on the teams. Each year, we get several people who have demonstrated the power of statistics by building models in order to predict the games. Some of them have been extremely successful with this – especially Bill Kahn with his Bradley-Terry models, showing that even something extremely unpredictable as sports can be forecasted through good statistical techniques. But the part of this that has made me happy – and why I do this – is because a few people who were not statisticians but were taking a stats training course at work used this data for their class project and ended up having some success – including our 2006 champion, David Shaddick.
So, since that point, I decided to provide the scores to everyone in an attempt to provide people as much of a chance to try to leverage data to make their decisions. I realize that most of you will probably spend three to five minutes just looking at the teams and figuring who will do best – I probably don’t need a model to decide that the number 1 seeds will beat the 16 seeds… In fact, I typically spend so much effort maintaining the site that I pick Purdue to go far and just randomly pick the other games late Wednesday evening.
However, if I can give people a chance to try to learn something about statistics in a very fun environment, it is well worth the effort.
If you notice something terribly wrong, let me know – no promises I have time to fix it, but at least everyone will know.
Enjoy the data!!!!