Modelers – the 2019 Schedule Data is Available


Well, it is ready through Sunday’s March 10th games!   For the most part, the file is the same as normal.   I have continued with the approach that conference tournament game are counted as conference games instead of post-season games.  There is one additional data point – while I have not calculated it, I used the NCAA’s new NET rankings to validate the records (since it lists the official neutral game records).  And since I had the new ranking on the file, I added it to the standings file.

I have also added two PDF files – for those of you who would like to see what some of the data sheets that the Selection Committee gets when making their decisions.  Fortunately for all of us, the NCAA puts their Team Sheets (which breaks each teams schedule into different rankings quadrants) and NET Nitty Gritty summary files on their RPI Archives Page – and so I have copied them and loaded them to the Research tab along with the Schedule 2019 Excel document.

No promises that I will update these three files every day, but wanted to make this available for everyone – and will update as I have time throughout the week – the spreadsheet has a page that says when it is last updated.

For those of you who want to do crazy statistical research, build models, or just have all the schedule data at your fingertips to evaluate teams, the data is there in the research links under Schedule 2019.

Obviously, remember the traditional Lunatic disclaimers.  I have done some basic cleaning and quality checks that the records from the schedule I have match the official NCAA site – but there are a lot of games, and so I will not make the claim that I have checked every piece of the dataset.  More importantly, because of multiple changes to the NCAA’s website (and my ramblings last week of difficulties pulling this data due to blanks in the box scores), there are potential issues to be checked.  For example, I have noticed that some of the home sites seem to be attendance figures due to missing information on the box score summary.  I suspect the scores are right since the complete records are correct.  But take the data with a grain of salt.

That being said, one really interesting thing that this file does create is a side-by-side comparison of the old RPI calculation (which my tool still calculates – as does some other webpages) vs. the new NET model that the Selection Committee is using to rank games into the quadrants.  I will probably have to ramble about it – but lets just say that from a quick glance, North Carolina State and Indiana are thanking their lucky stars that the NCAA has moved to the NET score, and Arizona State, Seton Hall and Temple might be eventually wishing that the RPI was still the NCAA’s ranking system.

For those of you who are not familiar with this tradition of me doing insane data pulls to grab all this great college basketball data, I will give you some more details.

As many of you know, one of my insane features is that I try to provide people with data about the teams in case they want to do research on the teams. Each year, we get several people who have demonstrated the power of statistics by building models in order to predict the games. Some of them have been extremely successful with this – especially Bill Kahn with his Bradley-Terry models, showing that even something extremely unpredictable as sports can be forecasted through good statistical techniques. But the part of this that has made me happy – and why I do this – is because a few people who were not statisticians but were taking a stats training course at work used this data for their class project and ended up having some success – including our 2006 champion, David Shaddick.

So, since that point, I decided to provide the scores to everyone in an attempt to provide people as much of a chance to try to leverage data to make their decisions. I realize that most of you will probably spend three to five minutes just looking at the teams and figuring who will do best – I probably don’t need a model to decide that the number 1 seeds will beat the 16 seeds… In fact, I typically spend so much effort maintaining the site that I pick Purdue to go far and just randomly pick the other games late Wednesday evening.

However, if I can give people a chance to try to learn something about statistics in a very fun environment, it is well worth the effort.

If you notice something terribly wrong, let me know – no promises I have time to fix it, but at least everyone will know.

Enjoy the data!!!!


Leave a Reply

Your email address will not be published. Required fields are marked *