Modelers – the 2016 schedule file is ready


Well, it is ready through Thursday’s games!   For the most part, the file is the same as normal – maybe a couple extra fields of information.   The biggest data difference is that I noticed the NCAA RPI pages classify the conference tournament games as conference games instead of post-season games (which makes perfect sense when you think about how they use the non-conference strength of schedule and record).

So, I have not gone through the trouble of marking those games off with a flag of 2 stating they are the tournament games.

But for those of you who want to do crazy statistical research, build models, or just have all the schedule data at your fingertips to evaluate teams, the data is there in the research links under 2016 Schedule.

Obviously, remember the traditional Lunatic disclaimers.  I have done some basic cleaning and quality checks against RPI data – but there are a lot of games, and so I will not make the claim that I have checked every piece of the dataset.

For those of you who are not familiar with this tradition, I will give you some more details.

As many of you know, one of my insane features is that I try to provide people with data about the teams in case they want to do research on the teams. Each year, we get several people who have demonstrated the power of statistics by building models in order to predict the games. Some of them have been extremely successful with this – especially Bill Kahn with his Bradley-Terry models, showing that even something extremely unpredictable as sports can be forecasted through good statistical techniques. But the part of this that has made me happy – and why I do this – is because a few people who were not statisticians but were taking a stats training course at work used this data for their class project and ended up having some success – including our 2006 champion, David Shaddick.

So, since that point, I decided to provide the scores to everyone in an attempt to provide people as much of a chance to try to leverage data to make their decisions. I realize that most of you will probably spend three to five minutes just looking at the teams and figuring who will do best – I probably don’t need a model to decide that the number 1 seeds will beat the 16 seeds… In fact, I typically spend so much effort maintaining the site that I just randomly pick late Wednesday evening.

However, if I can give people a chance to try to learn something about statistics in a very fun environment, it is well worth the effort.

If you notice something terribly wrong, let me know – no promises I have time to fix it, but at least everyone will know.

Enjoy the data!!!!


Leave a Reply

Your email address will not be published. Required fields are marked *