Thursday, November 29, 2012

Parity Between Boy's and Girl's Basketball

I haven't written much about Girl's basketball up to this point and that is mostly driven from the fact I don't have as much information on it.  I've been following Boy's basketball a lot longer and have 3 full seasons of data as opposed to only one for the Girl's.  But, when I decided to do this blog I wanted to cover both.  Math is math after all, it doesn't care who is playing.  I was also curious if my theories held true.

What I have found so far is that it does hold up, but not in quite the same way.  I'm not going to quote the numbers yet because there really isn't enough to make it significant, but the trend lines do look similar.  There are differences however.  The minimums are much higher.  Where as in the Boy's bottom group the enrollment variance is between 0-31, for the Girl's it jumps all the way up to 200.  That will probably change some as I get more years, but the largest for any individual season I have on the Boy's so far has been 81. 

The winning percentages on the larger enrollment variances aren't as great as the Boy's either.  That might just be a statistical fluke from last year, or it might be that not as many large schools play small schools on the Girl's side (or maybe they play more).  My hunch is that it's neither of those but rather that choosing a Girl's basketball team is a bit more hit or miss than on the Boy's side.  I can't prove that with the limited data, but if you look at the number of teams that had a winning percentage greater than .800 last year, there were 15 on the Girl's side and only 12 on the Boy's.  Correspondingly there were 13 Girl's teams with winning percentages less than .200 as opposed to 12 for the Boy's.  That means 28 Girl's teams were in the extreme top or bottom in winning percentage with 24 in the same company on the Boy's side.  Not a huge gap, but if that trend is continued year to year it could be a significant one. 

Monday, November 26, 2012

Does the Leading Scorer Always Win?

One thing I've wondered about is how important it is to have the game's leading scorer.  So, we're going to find out together.  Each week I'll be posting the high scorer for each game and whether or not their team won and the running W-L record.  I'll post separate pages for each with all the games as well.

Boys Week One 11/20/12-11/26/12
Garrett Pitcher, Berne-Knox, Neutral floor Tamarac, 28 pts, W     1-0
Andrew Spath, Rensselaer, vs. Heatly, 19 pts, W     2-0
Garrett Pitcher, Berne-Knox, @ Rensselaer, 22 pts. W     3-0
Lucas Van Nostrand, Mayfield and Logan Samarija, Maple Hill, Neutral floor, 18 pts each, T     3-0-1
Ethan Ross-Hixson, Hoosic Valley, vs. Duanesburg, 18 pts, W     4-0-1
Pat LaPorte, Salem, @ Greenwich, 20 pts, L     4-1-1
Jordan Van Nostrand, Mayfield, Neutral floor Duanesburg, 18 pts, W     5-1-1
John Rooney, Hoosic Valley, vs. Maple Hill, 23 pts, W     6-1-1

Girls Week One 11/20/12-11/26/12
Hailee Metzold, Schalmont, Neutral floor Mekeel Academy, 22 pts, W     1-0
Paris Jarrell, Bishop Gibbons, vs. Cobleskill, 13 pts, L     1-1
Ailayia Demand, Watervliet, vs. Albany Leadership Charter, 19 pts, W     2-1
Jen Groat, Ballston Spa, Neutral floor Cohoes, 20 pts, W     3-1
Macie Holmes, Mekeel Academy, @ Bishop Gibbons, 22 pts, W     4-1
Alexandra Cardinal, Schalmont, Neutral floor Cobleskill, 19 pts, W     5-1
Mikayla DeGuire, Watervliet, vs. Ballston Spa, 16 pts, W     6-1
Hannah Saroff, Lake George, vs. Greenwich, 16 pts, W     7-1

Annual Scoring Totals
I'm also going to try to post when a player reaches a milestone such as 30 point games or when someone reaches 100 points on the season (or 200, or 300, etc.).  Since no one has played more than two games no one has quite gotten there yet, but Garrett Pitcher of Berne-Knox is at 50 already after playing just two games.

Thursday, November 22, 2012

Early Signs of a Tough Season

As with anything it's difficult to take one data point, in this case a game, as meaning too much especially the first games of the season.  There are however, some early signs that Tamarac will probably have a tough year.  For starters, in their first game of the season they held a 118 enrollment edge on Berne-Knox which would historically result in Tamarac winning 56.7% of the time.  Losing this type of game isn't in itself reason to cause alarm, but losing by 36 points is. 

Over the last 3 seasons only three teams have finished the season with an average margin of loss greater than 30. Those three teams had a combined record of 1 win and 44 losses against Section 2 teams not including themselves. My formula would give them an expected win percentage of .000 if that trend continued (which is probably unlikely), but if you use the 1-44 mark, that would be a .022.  Either way that averages out to zero wins for an 18 game season.

I'm still waiting to see how they fared in the Heatly game which will give us a bit more information and will more than likely change everything I just wrote about.  The enrollment edge is 272 for Tamarac in that one and gives them a 59.4% likelihood.  Regardless, the two wins by Berne-Knox and Rensselaer were impressive and the opposite of the above can be said for them.  It will be interesting to see how that game turned out and what clues it can give us for the rest of their seasons.

One game doesn't make a season, and I can't even say I know whether or not everyone on these teams played, but it can give you a little insight as to how the season will progress. 

Saturday, November 17, 2012

The Prediction Variables

Predicting is a risky business.  Lots of people are wrong, and often aren't even remotely close.  Some people are dead on.  Very few are good enough to make a living out of it by writing or other means.  The difficulty in using regression to predict things in sports is you typically don't have ample or any information within a single season.  In politics, you can have hundreds of polls showing real people's tendencies.  In high school basketball you have 18 games and 9 conferences (or leagues).  The chances that everyone in a sectional bracket will have played each other is zero and you'll be lucky if you even get one first round game where two teams have met in the regular season. 

As I previously mentioned there are almost no advanced statistical metrics and even if there were these are high school athletes, not professionals.  Inconsistent play is inevitable and trying to figure out how a team will play on a given night against an unknown opponent is nearly impossible.  That's not to say all hope is gone.  I already have a model that will rank the teams by their propensity to win sectionals.  By design it's a conservative model using only standard statistical variables that can be used for any team under any circumstance to arrive at a fair representation of their standing.  But, if I want to predict winners I need to be a little more aggressive with my variables.

As anyone who has paid any attention to Section 2 basketball recently can tell you CBA is pretty good year to year.  They have won 7 of the 9 AA sectionals since it was changed from a 4 class system to a 5 class system.  They lost in the finals in the other two.  They lost both to Bishop Maginn.  If you missed that, that means all nine AA champions have come from the Big 10.  Of the 18 teams to play for the AA championship in Section 2, 15 have come from the Big 10 or 83%.  So, some value can be taken from past sectional performance and that is where I went for my alternative prediction variable.  For this model I'll be replacing the Opponent's opponent's variable with a historical sectional winning percentage variable.

There are dangers in using this type of variable.  Most teams just haven't played enough games to make it valuable or predictive.  After all, if you lose in the first round you only play one game and it's a loss.  To get back to .500 you have to win 2 games the next year which only 4 teams do in each class each year.  Teams can also switch classes year to year.  Over the past 5 seasons Queensbury has gone between AA and A and has a .250 winning percentage in AA while having a .500 in A.  Using their overall winning percentage wouldn't be accurate in either class since they are playing different competition, but they've only played 4 games in each class, which isn't really enough to base any statistical merit on.

There is also a problem with turnover.  Teams typically turn over their entire roster every two years.  Coaching can account for some of the level of performance but not enough in most cases to outweigh changing rosters every year.  This is why I've decided to use the performance of each team's conference in sectionals counting a bye game as a win and a team's non sectional participation as a loss.  This method isn't without flaws either which I'll get into, but for the moment I'll focus on its positives.  In AA for example, the Big 10 has won .595 over the past 5 years and the Suburban has only won .308.  Giving each team in the Big 10 the .595 winning percentage for that variable will increase their odds of winning in the regression model.  There are also significantly more games as well so the numbers have greater validity.  The Big 10 has played 84 games and the Suburban 65 whereas the greatest number of games of any individual team is 20 and only 4 of the 19 who have played games in the AA bracket have played at least 10.

So what are the negatives?  For starters if you exclude CBA and Bishop Maginn the Big 10 only wins .383.  It's still better than the Suburban, but not significantly.  The other big negative is a trend that I've noticed where some conferences go into cycles where they are better than at other times.  When I first started watching Section 2 basketball the Patroon Conference hadn't won a championship in any class in over a decade (I think it was 1986, but don't quote me).  They've won 3 in the past 5 years.  There was one season in the late 90's when the CHVL as a league only won 2 non-league games the entire season.  Last year in sectionals they won 3 of 5.

If I'm sounding a bit pessimistic it's because I am.  I ran a model similar to this one about ten years ago and it performed pretty well, and did better than the committee, but I haven't run this one specifically ever.  I'm pessimistic by nature so do let that get in the way.  Anytime you put your name on a prediction you're taking a risk and I'll be doing the same come mid February.  My fingers are crossed and I'm knocking on wood, let the games begin.

Monday, November 12, 2012

Parochial, Charter Schools and out of Section Games

I have previously mentioned that the variables in my Sectional Forecasting model put things in terms of a winning percentage, whether it be actual or expected.  I have also mentioned how the basis for two of the variables relies on a comparison of enrollments between the team in question and its opponents or its opponent's opponents.  They are all driven from a database of games played.  What I have failed to mention is that there is more than one database.  So far, the stats I have referenced have been from the Public vs. Public database due to it has a much greater size and should be a little more stable.  There is also a database for Public vs. Private and Private vs. Private.

Before I go any further with the other databases, there is one other thing I have failed to mention.  The model does not take into account any games played with an opponent that resides outside of Section 2.  The reason for this is pretty straightforward, I have no information on these schools and I really just don't have the time to try and find it.  Maintaining all these databases is pretty labor intensive as it is and we only have roughly 90 schools in Section 2.  To do it for the whole state and in some cases Vermont and New Jersey would just be way too time consuming.  So, just to be clear moving forward, any records I mention or stats I give are only for games against Section 2 schools unless otherwise stated.

One other bit to avoid confusion, when I refer to Private schools I am referring to both Parochial and Charter Schools.  That is for the reason that both are allowed to accept applicants and aren't restricted at all, or partially restricted by which school district those applicants come from.  It's also cleaner to use one name instead of two and in either case, comparing their enrollments as if they were a Public school doesn't work, but when you compare them to each other you get more of an apples to apples comparison.  And, when you compare them as a group to the public schools you can get a bit of a read on what the relationship between their enrollments is.

First off, let's compare Private schools vs. Private schools.  If the theory holds true that two schools who live life by the same rules get better as their enrollment increases we should see a similar effect that we do in the Public vs. Public database.  In the largest category with a greater than 900 enrollment variance, the larger school wins 85% of the time.  In the largest category for Private vs. Private schools, a variance of greater than 275, the larger school wins 87.5% of the time.  Certainly comparable when you take into account the significantly reduced enrollments of the Private schools.  And the trend is comparable as well going from a .441 winning percentage when the variance is less than 81 to a .632 between 81 and 274 and up to .875 over 275.

Now that we know the theory holds true, at least in Section 2 over the past three years, we can compare the Public schools to the Private schools.  When I first started trying to figure this out, I had this thought that if I could come up with a multiple for Private schools, I could then compare Public and Private on the same level.  Unfortunately it failed miserably every time.  There just isn't enough common ground between the Private schools that you can use one multiple.  I also tried using several different ones like the size of the city the school is located, but it became too muddled and I ended up with schools needing to change class to make it work. 

But, if you separate out the Private schools and compare them as a group to the Public schools you can use that information in the same way.  You can take each game they play and get an expected winning percentage just as you do with the Public schools.  So, how exactly do they compare?  Well, not as cleanly as you might expect, but when you look more closely it does make sense.  When a Public school plays a Private school and the variance in enrollment is less than 122, the larger school wins 54% of the games.  When the variance is greater than 1899, they win 66%.  Now here is where it gets interesting, between 122 and 1899, the smaller team wins 60%. 

The driving force behind this is which teams are playing these games.  For the most part, the bookend categories when the larger school is winning, the Private school in question is one of the class D schools like Doane Stuart or Hawthorne Valley.  Some of them have very small enrollments.  Often times, these schools aren't focused on basketball, but rather academics (how dare they!), are located far away from the larger cities in the area (so they don't have a drawing card) and don't always field competitive teams.

The middle group however, is mostly made up of the larger bigger city schools like CBA, Albany Academy or LaSalle.  These type of schools are able to pull talent away from the larger Public schools which accomplishes two things, it makes them better and their opponents worse.  In this case, CBA is still expected to win a majority of games as one would think.  The other schools seem to line up pretty well too.  I anticipate that as I obtain more years data I can probably break this group down a bit further, but at the moment there wouldn't be enough games to make it statistically viable. 

Now you know how the model works and how the pieces fit together so you can sit back watch the scores and let me handle the math.  My next post will be covering the other model I have designed which is more of a predictive model that includes a different variable altogether.  Beyond that I hope to get a page up with all the teams and their enrollments for this season and their 5 year averages.  Once I get full schedules from the schools I hope to put up the hardest and easiest schedules in Section 2.  And, of course, the games are almost here.

Saturday, November 10, 2012

The Variables of Sectional Forecasting, part 2

I've already established how winning percentage is a valuable tool in determining sectional winners and why it is the first variable in my equation.  I've also discussed the second variable, using the variance between a team's actual winning percentage and what they would have expected to win based on their enrollment and that of their opponents.  When I first looked into working on this project I looked to college basketball's RPI.  The RPI is a straight forward calculation using 25% or your winning percentage, 50% or your opponent's winning percentage and 25% of your opponent's opponent's winning percentage.  While being fairly easy to calculate, most people agree it's also not overly effective.

My original model was designed using the same three variables, only I used regression to figure out what those percentages should be instead of just assuming I knew the answer.  What I found was that even though it does a pretty good job of predicting winners, the last variable didn't add very much value to the formula.  Your opponent's winning percentage, while not adding as much value as your own, did offer some so I kept it but modified it slightly as I did with my second variable.  The theory is the same, what is the winning percentage of your opponent's compared to what one would statistically expect given enrollment factors.

If you have looked at schedules at all over the past 10 years, you will notice certain teams play different types of teams in their non-league schedules.  Schuylerville has for a while now played teams with much larger enrollments and has even played Catholic Central a few times.  Other teams in the Wasaren League take a different tactic and play teams that are closer in proximity but smaller in size.  For example, Cambridge may play Salem instead of Schoharie.  When Schuylerville chooses to play larger schools it does give them a bonus in the model, but the third variable is meant to offer further perspective.  Let's say Stillwater plays Hudson Falls.  Hudson Falls has about a 200 differential in enrollment but Stillwater wins the game.  Our second variable gives them extra points for beating a team they probably shouldn't based on enrollment.  But, when you look further you realize Hudson Falls only won 1 game all year and significantly under-performed their schedule.  The third variable acts as an adjustment to the second giving you credit when you beat a team you shouldn't and blame when you lose to a team you should beat.

The final variable is based on points.  One difficulty in doing a model like this is that there isn't a ton of data to use.  Not every school calls in all their games and scores may be different depending on the source you're reading or hearing it from.  There is little in the way of advanced metrics like they have in the NBA or Major League Baseball.  Points, however should have some value in determining the level of play of a team.  Continuing with the theme of having all the variables in terms of winning percentage I had to develop a method to turn Point Margin into a winning percentage.  

If you assume that a .500 team averages a zero Point Margin per Game you can extrapolate that data out in both directions until you reach the max and min of Point Margin per Game.  To determine if that assumption is accurate I figured out all the PMPG of all the teams in each year over the past three.  It turns out it's not zero, but it's really close and it's less than one at -0.6.  So now, at any time, I can tell you how many games you should have won by figuring out your PMPG and comparing that to historical standards.  

I would like to stress again that this model is only for determining sectional seeding and how a team compares to another based on the four variables in the model and how those variables weigh in on how well a team does in sectionals.  As the season goes on and more game information is available, the model will become more accurate, which I why I'm waiting until there are at least 5 games.  I've only ever run one of these models at season's end so I'm really not sure what it's going to look like as we go.  I would also like to stress that I only have 3 full seasons of information, essentially only 3 data sets (15 if you count each class as a set).  That is really not very many and with each year, the model should become more accurate and reliable.

As of now I have mainly focused on the Boys' basketball, but I will be working on the Girl's as well, but since I only have one full season for them, it's still a work in progress.  Up next I'll handle the difficulty of dealing with Parochial and Charter schools and how to use their enrollment in the model.

Friday, November 2, 2012

The Variables of Sectional Forecasting, part 1

"Let's start with something simple, like one and one ain't three."  Jimmy Buffett had it right when he wrote those lyrics.  If you're going to try and figure out what variables are most important in figuring out anything you should probably start with something simple.  Winning percentage is the quickest and easiest way to figure out who is good and who isn't.  The Section 2 committee follows this number pretty closely in their seeding order.  But, just because it's quick and easy doesn't mean it's always right and just because it's not always right doesn't mean it's not valuable in predicting sectional winners.  It is also why it is my first variable.

Over the past three seasons, 8 of 15 teams with the best winning percentage in their class won sectionals.  It's by no means perfect, but if I can tell you that 53% of teams win based on one variable I would have a pretty good starting point for my statistical model.  What I have tried to do with this model is to use winning percentage as my basis and use other variables to help explain why 47% of teams with the best record don't win sectionals. 

A lot of the theory behind this model has been based on observations over the years watching high school basketball.  Some of it comes from intuition and some from trial and error.  One thing I had always assumed was that larger schools were incrementally better than smaller schools.  If you have 1,000 students to choose from to form a basketball team, you will probably have better odds of finding more talent than if you have 100 students.  The NYSPHSAA apparently feels the same way as they have split their schools into five classes based on enrollment.  Large schools compete against large schools, small against small.  The question I always had was whether or not it really did matter and if so how much and how can I incorporate that into my formula. 

Now I can answer those questions and I can incorporate it as my second variable.  Over the past three seasons the team with the greater enrollment wins 58% of the games.  Over an 18 game schedule that only amounts to 3 games over .500, but over an 82 game season like the NBA plays, they would be 13 games over .500 and certainly in the playoffs.  But even that doesn't tell the whole story.  Teams with a 200 student advantage win 66% of the time.  That is 6 games over .500 for a high school schedule or 27 games over .500 in the NBA and that will have you in the hunt for a division title.  This can have a tremendous effect on a team in section 2.  Broadalbin-Perth plays 8 conference games against teams with more than a 200 enrollment advantage.  That means on average they are starting the season 3-5.

This is the basis for the second variable in the model.  If Broadalbin-Perth wins 4 of those 8 games they have performed better than the average team under those circumstances and have therefore increased their likelihood of winning sectionals.  I will take every team and calculate every schedule to determine what their expected winning percentage would be and compare that to how they actually performed.  That variance is the second variable.

In the following posts leading up to the start of the regular season, I'll be discussing the other two variables, the variance of your opponents winning percentage compared to their expected winning percentage and point differential.  I'm also hoping to show how private schools fit into the equation.  They create their own issues and need to be handled differently.  Finally, I'll be discussing my prediction model and how it differs from the forecasting model.  There are three weeks until the Thanksgiving tournaments and hopefully I'll be able to get these three posts out before then.