Saturday, November 10, 2012

The Variables of Sectional Forecasting, part 2

I've already established how winning percentage is a valuable tool in determining sectional winners and why it is the first variable in my equation.  I've also discussed the second variable, using the variance between a team's actual winning percentage and what they would have expected to win based on their enrollment and that of their opponents.  When I first looked into working on this project I looked to college basketball's RPI.  The RPI is a straight forward calculation using 25% or your winning percentage, 50% or your opponent's winning percentage and 25% of your opponent's opponent's winning percentage.  While being fairly easy to calculate, most people agree it's also not overly effective.

My original model was designed using the same three variables, only I used regression to figure out what those percentages should be instead of just assuming I knew the answer.  What I found was that even though it does a pretty good job of predicting winners, the last variable didn't add very much value to the formula.  Your opponent's winning percentage, while not adding as much value as your own, did offer some so I kept it but modified it slightly as I did with my second variable.  The theory is the same, what is the winning percentage of your opponent's compared to what one would statistically expect given enrollment factors.

If you have looked at schedules at all over the past 10 years, you will notice certain teams play different types of teams in their non-league schedules.  Schuylerville has for a while now played teams with much larger enrollments and has even played Catholic Central a few times.  Other teams in the Wasaren League take a different tactic and play teams that are closer in proximity but smaller in size.  For example, Cambridge may play Salem instead of Schoharie.  When Schuylerville chooses to play larger schools it does give them a bonus in the model, but the third variable is meant to offer further perspective.  Let's say Stillwater plays Hudson Falls.  Hudson Falls has about a 200 differential in enrollment but Stillwater wins the game.  Our second variable gives them extra points for beating a team they probably shouldn't based on enrollment.  But, when you look further you realize Hudson Falls only won 1 game all year and significantly under-performed their schedule.  The third variable acts as an adjustment to the second giving you credit when you beat a team you shouldn't and blame when you lose to a team you should beat.

The final variable is based on points.  One difficulty in doing a model like this is that there isn't a ton of data to use.  Not every school calls in all their games and scores may be different depending on the source you're reading or hearing it from.  There is little in the way of advanced metrics like they have in the NBA or Major League Baseball.  Points, however should have some value in determining the level of play of a team.  Continuing with the theme of having all the variables in terms of winning percentage I had to develop a method to turn Point Margin into a winning percentage.  

If you assume that a .500 team averages a zero Point Margin per Game you can extrapolate that data out in both directions until you reach the max and min of Point Margin per Game.  To determine if that assumption is accurate I figured out all the PMPG of all the teams in each year over the past three.  It turns out it's not zero, but it's really close and it's less than one at -0.6.  So now, at any time, I can tell you how many games you should have won by figuring out your PMPG and comparing that to historical standards.  

I would like to stress again that this model is only for determining sectional seeding and how a team compares to another based on the four variables in the model and how those variables weigh in on how well a team does in sectionals.  As the season goes on and more game information is available, the model will become more accurate, which I why I'm waiting until there are at least 5 games.  I've only ever run one of these models at season's end so I'm really not sure what it's going to look like as we go.  I would also like to stress that I only have 3 full seasons of information, essentially only 3 data sets (15 if you count each class as a set).  That is really not very many and with each year, the model should become more accurate and reliable.

As of now I have mainly focused on the Boys' basketball, but I will be working on the Girl's as well, but since I only have one full season for them, it's still a work in progress.  Up next I'll handle the difficulty of dealing with Parochial and Charter schools and how to use their enrollment in the model.

No comments:

Post a Comment