Methodology

"Let's start with something simple, like one and one ain't three."  Jimmy Buffett had it right when he wrote those lyrics.  If you're going to try and figure out what variables are most important in figuring out anything you should probably start with something simple.  Winning percentage is the quickest and easiest way to figure out who is good and who isn't.  The Section 2 committee follows this number pretty closely in their seeding order.  But, just because it's quick and easy doesn't mean it's always right and just because it's not always right doesn't mean it's not valuable in predicting sectional winners.  It is also why it is my first variable.

Over the past three seasons, 8 of 15 teams with the best winning percentage in their class won sectionals.  It's by no means perfect, but if I can tell you that 53% of teams win based on one variable I would have a pretty good starting point for my statistical model.  What I have tried to do with this model is to use winning percentage as my basis and use other variables to help explain why 47% of teams with the best record don't win sectionals. 

A lot of the theory behind this model has been based on observations over the years watching high school basketball.  Some of it comes from intuition and some from trial and error.  One thing I had always assumed was that larger schools were incrementally better than smaller schools.  If you have 1,000 students to choose from to form a basketball team, you will probably have better odds of finding more talent than if you have 100 students.  The NYSPHSAA apparently feels the same way as they have split their schools into five classes based on enrollment.  Large schools compete against large schools, small against small.  The question I always had was whether or not it really did matter and if so how much and how can I incorporate that into my formula. 

Now I can answer those questions and I can incorporate it as my second variable.  Over the past three seasons the team with the greater enrollment wins 58% of the games.  Over an 18 game schedule that only amounts to 3 games over .500, but over an 82 game season like the NBA plays, they would be 13 games over .500 and certainly in the playoffs.  But even that doesn't tell the whole story.  Teams with a 200 student advantage win 66% of the time.  That is 6 games over .500 for a high school schedule or 27 games over .500 in the NBA and that will have you in the hunt for a division title.  This can have a tremendous effect on a team in section 2.  Broadalbin-Perth plays 8 conference games against teams with more than a 200 enrollment advantage.  That means on average they are starting the season 3-5.

This is the basis for the second variable in the model.  If Broadalbin-Perth wins 4 of those 8 games they have performed better than the average team under those circumstances and have therefore increased their likelihood of winning sectionals.  I will take every team and calculate every schedule to determine what their expected winning percentage would be and compare that to how they actually performed.  That variance is the second variable.

When I first looked into working on this project I looked to college basketball's RPI.  The RPI is a straight forward calculation using 25% or your winning percentage, 50% or your opponent's winning percentage and 25% of your opponent's opponent's winning percentage.  While being fairly easy to calculate, most people agree it's also not overly effective.

My original model was designed using the same three variables, only I used regression to figure out what those percentages should be instead of just assuming I knew the answer.  What I found was that even though it does a pretty good job of predicting winners, the last variable didn't add very much value to the formula.  Your opponent's winning percentage, while not adding as much value as your own, did offer some so I kept it but modified it slightly as I did with my second variable.  The theory is the same, what is the winning percentage of your opponent's compared to what one would statistically expect given enrollment factors.

If you have looked at schedules at all over the past 10 years, you will notice certain teams play different types of teams in their non-league schedules.  Schuylerville has for a while now played teams with much larger enrollments and has even played Catholic Central a few times.  Other teams in the Wasaren League take a different tactic and play teams that are closer in proximity but smaller in size.  For example, Cambridge may play Salem instead of Schoharie.  When Schuylerville chooses to play larger schools it does give them a bonus in the model, but the third variable is meant to offer further perspective.  Let's say Stillwater plays Hudson Falls.  Hudson Falls has about a 200 differential in enrollment but Stillwater wins the game.  Our second variable gives them extra points for beating a team they probably shouldn't based on enrollment.  But, when you look further you realize Hudson Falls only won 1 game all year and significantly under-performed their schedule.  The third variable acts as an adjustment to the second giving you credit when you beat a team you shouldn't and blame when you lose to a team you should beat.

The final variable is based on points.  One difficulty in doing a model like this is that there isn't a ton of data to use.  Not every school calls in all their games and scores may be different depending on the source you're reading or hearing it from.  There is little in the way of advanced metrics like they have in the NBA or Major League Baseball.  Points, however should have some value in determining the level of play of a team.  Continuing with the theme of having all the variables in terms of winning percentage I had to develop a method to turn Point Margin into a winning percentage.  

If you assume that a .500 team averages a zero Point Margin per Game you can extrapolate that data out in both directions until you reach the max and min of Point Margin per Game.  To determine if that assumption is accurate I figured out all the PMPG of all the teams in each year over the past three.  It turns out it's not zero, but it's really close and it's less than one at -0.6.  So now, at any time, I can tell you how many games you should have won by figuring out your PMPG and comparing that to historical standards.  

Parochial and Charter Schools and Non-section games

I have previously mentioned that the variables in my Sectional Forecasting model put things in terms of a winning percentage, whether it be actual or expected.  I have also mentioned how the basis for two of the variables relies on a comparison of enrollments between the team in question and its opponents or its opponent's opponents.  They are all driven from a database of games played.  What I have failed to mention is that there is more than one database.  So far, the stats I have referenced have been from the Public vs. Public database due to it has a much greater size and should be a little more stable.  There is also a database for Public vs. Private and Private vs. Private.

Before I go any further with the other databases, there is one other thing I have failed to mention.  The model does not take into account any games played with an opponent that resides outside of Section 2.  The reason for this is pretty straightforward, I have no information on these schools and I really just don't have the time to try and find it.  Maintaining all these databases is pretty labor intensive as it is and we only have roughly 90 schools in Section 2.  To do it for the whole state and in some cases Vermont and New Jersey would just be way too time consuming.  So, just to be clear moving forward, any records I mention or stats I give are only for games against Section 2 schools unless otherwise stated.

One other bit to avoid confusion, when I refer to Private schools I am referring to both Parochial and Charter Schools.  That is for the reason that both are allowed to accept applicants and aren't restricted at all, or partially restricted by which school district those applicants come from.  It's also cleaner to use one name instead of two and in either case, comparing their enrollments as if they were a Public school doesn't work, but when you compare them to each other you get more of an apples to apples comparison.  And, when you compare them as a group to the public schools you can get a bit of a read on what the relationship between their enrollments is.

First off, let's compare Private schools vs. Private schools.  If the theory holds true that two schools who live life by the same rules get better as their enrollment increases we should see a similar effect that we do in the Public vs. Public database.  In the largest category with a greater than 900 enrollment variance, the larger school wins 85% of the time.  In the largest category for Private vs. Private schools, a variance of greater than 275, the larger school wins 87.5% of the time.  Certainly comparable when you take into account the significantly reduced enrollments of the Private schools.  And the trend is comparable as well going from a .441 winning percentage when the variance is less than 81 to a .632 between 81 and 274 and up to .875 over 275.

Now that we know the theory holds true, at least in Section 2 over the past three years, we can compare the Public schools to the Private schools.  When I first started trying to figure this out, I had this thought that if I could come up with a multiple for Private schools, I could then compare Public and Private on the same level.  Unfortunately it failed miserably every time.  There just isn't enough common ground between the Private schools that you can use one multiple.  I also tried using several different ones like the size of the city the school is located, but it became too muddled and I ended up with schools needing to change class to make it work. 

But, if you separate out the Private schools and compare them as a group to the Public schools you can use that information in the same way.  You can take each game they play and get an expected winning percentage just as you do with the Public schools.  So, how exactly do they compare?  Well, not as cleanly as you might expect, but when you look more closely it does make sense.  When a Public school plays a Private school and the variance in enrollment is less than 122, the larger school wins 54% of the games.  When the variance is greater than 1899, they win 66%.  Now here is where it gets interesting, between 122 and 1899, the smaller team wins 60%. 

The driving force behind this is which teams are playing these games.  For the most part, the bookend categories when the larger school is winning, the Private school in question is one of the class D schools like Doane Stuart or Hawthorne Valley.  Some of them have very small enrollments.  Often times, these schools aren't focused on basketball, but rather academics (how dare they!), are located far away from the larger cities in the area (so they don't have a drawing card) and don't always field competitive teams.

The middle group however, is mostly made up of the larger bigger city schools like CBA, Albany Academy or LaSalle.  These type of schools are able to pull talent away from the larger Public schools which accomplishes two things, it makes them better and their opponents worse.  In this case, CBA is still expected to win a majority of games as one would think.  The other schools seem to line up pretty well too.  I anticipate that as I obtain more years data I can probably break this group down a bit further, but at the moment there wouldn't be enough games to make it statistically viable.

Prediction Variables


As I previously mentioned there are almost no advanced statistical metrics and even if there were these are high school athletes, not professionals.  Inconsistent play is inevitable and trying to figure out how a team will play on a given night against an unknown opponent is nearly impossible.  That's not to say all hope is gone.  I already have a model that will rank the teams by their propensity to win sectionals.  By design it's a conservative model using only standard statistical variables that can be used for any team under any circumstance to arrive at a fair representation of their standing.  But, if I want to predict winners I need to be a little more aggressive with my variables.

As anyone who has paid any attention to Section 2 basketball recently can tell you CBA is pretty good year to year.  They have won 7 of the 9 AA sectionals since it was changed from a 4 class system to a 5 class system.  They lost in the finals in the other two.  They lost both to Bishop Maginn.  If you missed that, that means all nine AA champions have come from the Big 10.  Of the 18 teams to play for the AA championship in Section 2, 15 have come from the Big 10 or 83%.  So, some value can be taken from past sectional performance and that is where I went for my alternative prediction variable.  For this model I'll be replacing the Opponent's opponent's variable with a historical sectional winning percentage variable.

There are dangers in using this type of variable.  Most teams just haven't played enough games to make it valuable or predictive.  After all, if you lose in the first round you only play one game and it's a loss.  To get back to .500 you have to win 2 games the next year which only 4 teams do in each class each year.  Teams can also switch classes year to year.  Over the past 5 seasons Queensbury has gone between AA and A and has a .250 winning percentage in AA while having a .500 in A.  Using their overall winning percentage wouldn't be accurate in either class since they are playing different competition, but they've only played 4 games in each class, which isn't really enough to base any statistical merit on.

There is also a problem with turnover.  Teams typically turn over their entire roster every two years.  Coaching can account for some of the level of performance but not enough in most cases to outweigh changing rosters every year.  This is why I've decided to use the performance of each team's conference in sectionals counting a bye game as a win and a team's non sectional participation as a loss.  This method isn't without flaws either which I'll get into, but for the moment I'll focus on its positives.  In AA for example, the Big 10 has won .595 over the past 5 years and the Suburban has only won .308.  Giving each team in the Big 10 the .595 winning percentage for that variable will increase their odds of winning in the regression model.  There are also significantly more games as well so the numbers have greater validity.  The Big 10 has played 84 games and the Suburban 65 whereas the greatest number of games of any individual team is 20 and only 4 of the 19 who have played games in the AA bracket have played at least 10.

So what are the negatives?  For starters if you exclude CBA and Bishop Maginn the Big 10 only wins .383.  It's still better than the Suburban, but not significantly.  The other big negative is a trend that I've noticed where some conferences go into cycles where they are better than at other times.  When I first started watching Section 2 basketball the Patroon Conference hadn't won a championship in any class in over a decade (I think it was 1986, but don't quote me).  They've won 3 in the past 5 years.  There was one season in the late 90's when the CHVL as a league only won 2 non-league games the entire season.  Last year in sectionals they won 3 of 5.


No comments:

Post a Comment