Saturday, November 17, 2012

The Prediction Variables

Predicting is a risky business.  Lots of people are wrong, and often aren't even remotely close.  Some people are dead on.  Very few are good enough to make a living out of it by writing or other means.  The difficulty in using regression to predict things in sports is you typically don't have ample or any information within a single season.  In politics, you can have hundreds of polls showing real people's tendencies.  In high school basketball you have 18 games and 9 conferences (or leagues).  The chances that everyone in a sectional bracket will have played each other is zero and you'll be lucky if you even get one first round game where two teams have met in the regular season. 

As I previously mentioned there are almost no advanced statistical metrics and even if there were these are high school athletes, not professionals.  Inconsistent play is inevitable and trying to figure out how a team will play on a given night against an unknown opponent is nearly impossible.  That's not to say all hope is gone.  I already have a model that will rank the teams by their propensity to win sectionals.  By design it's a conservative model using only standard statistical variables that can be used for any team under any circumstance to arrive at a fair representation of their standing.  But, if I want to predict winners I need to be a little more aggressive with my variables.

As anyone who has paid any attention to Section 2 basketball recently can tell you CBA is pretty good year to year.  They have won 7 of the 9 AA sectionals since it was changed from a 4 class system to a 5 class system.  They lost in the finals in the other two.  They lost both to Bishop Maginn.  If you missed that, that means all nine AA champions have come from the Big 10.  Of the 18 teams to play for the AA championship in Section 2, 15 have come from the Big 10 or 83%.  So, some value can be taken from past sectional performance and that is where I went for my alternative prediction variable.  For this model I'll be replacing the Opponent's opponent's variable with a historical sectional winning percentage variable.

There are dangers in using this type of variable.  Most teams just haven't played enough games to make it valuable or predictive.  After all, if you lose in the first round you only play one game and it's a loss.  To get back to .500 you have to win 2 games the next year which only 4 teams do in each class each year.  Teams can also switch classes year to year.  Over the past 5 seasons Queensbury has gone between AA and A and has a .250 winning percentage in AA while having a .500 in A.  Using their overall winning percentage wouldn't be accurate in either class since they are playing different competition, but they've only played 4 games in each class, which isn't really enough to base any statistical merit on.

There is also a problem with turnover.  Teams typically turn over their entire roster every two years.  Coaching can account for some of the level of performance but not enough in most cases to outweigh changing rosters every year.  This is why I've decided to use the performance of each team's conference in sectionals counting a bye game as a win and a team's non sectional participation as a loss.  This method isn't without flaws either which I'll get into, but for the moment I'll focus on its positives.  In AA for example, the Big 10 has won .595 over the past 5 years and the Suburban has only won .308.  Giving each team in the Big 10 the .595 winning percentage for that variable will increase their odds of winning in the regression model.  There are also significantly more games as well so the numbers have greater validity.  The Big 10 has played 84 games and the Suburban 65 whereas the greatest number of games of any individual team is 20 and only 4 of the 19 who have played games in the AA bracket have played at least 10.

So what are the negatives?  For starters if you exclude CBA and Bishop Maginn the Big 10 only wins .383.  It's still better than the Suburban, but not significantly.  The other big negative is a trend that I've noticed where some conferences go into cycles where they are better than at other times.  When I first started watching Section 2 basketball the Patroon Conference hadn't won a championship in any class in over a decade (I think it was 1986, but don't quote me).  They've won 3 in the past 5 years.  There was one season in the late 90's when the CHVL as a league only won 2 non-league games the entire season.  Last year in sectionals they won 3 of 5.

If I'm sounding a bit pessimistic it's because I am.  I ran a model similar to this one about ten years ago and it performed pretty well, and did better than the committee, but I haven't run this one specifically ever.  I'm pessimistic by nature so do let that get in the way.  Anytime you put your name on a prediction you're taking a risk and I'll be doing the same come mid February.  My fingers are crossed and I'm knocking on wood, let the games begin.

No comments:

Post a Comment