League leaders are a common parts of any sports statistic web site. Who scores the most goals, who gets the most assists, who hits the most homer runs. On any number of the big sports sites you can easily find that for any sport. Averages, in general, though still pretty basic, tell slightly more than counting stats. Averages will give you a better indicator (than counting) of how a player does from game to game, and while not incredibly useful in general, I think it’s a useful goal in regards to building out my application.

Counting averages, however, can be tricky, Let’s say a player participates in 3 games, he scores 10, 5, and 60, and then is injured and doesn’t play the rest of the season. That player will have an average of 25 points per game, which is a great average not doubt, but it’s only over 3 games. In the NBA there are 82 games in a season. A player who averages 25 games a season is often seen as a good player (more on that later), but if you only know the average and not that the player player played in less than 0.5% of the games, you’re not getting the whole story, and thus we lead to qualifications

Qualifying for the Leader Board

Over any season, in any sport, players will definitely miss games, due to injuries, rest, or personal issues, most players will not play every game in a season, but, a level should be established in terms of playing time (how that is defined varies from sport to sport) that would qualify a player to be considered for being one of the league leaders and any categories that you wish to determine. In baseball for instance, I believe it is counted in terms of plate appearances (that’s not the same as at bats people). I don’t know how other websites do it in basketball, but what I decided would be the qualifier is that a player averages at least 20 minutes per game (slightly less than half a game) and has appeared in half the games.

Now I’m sure there’s a good way to determine if a player has appeeared in half of the games their team played on but when you take into roster movement, which I don’t fully track, I had to come up with another way to determine half the games, and what I came up with is not perfect but I think it will work.

There are a set number of distinct dates in which games are played in the NBA, and let’s say, on average, that a team plays every 2.5 days. (82 * 2.5 is 205 / 30 is is 6.8 months, which about the length of the regular season). Again, I’m not looking to be exactly right here, I’m looking to find a way to qualify players who play often enough to be considered for my leader board calculations. So, if you take the distinct number of dates on which games are played, divide that by 2.5 and then divide it by 2 again (because remember we only need them to appear in at least half the games), you should get a rough estimate of half the number of games played by any team at any given date in the season.

In SQL Terms, it looks something like this.

Select count(distinct gamedate)/5 FROM games

Now since this is something that will be called constantly, it should be defined as a scope (or method) in Rails. Recently, I’ve been researching ActiveRecord queries a bit more and feel that for me, using methods instead of scopes works better, so that’s what I’ll do, but first, you know, you gotta write the test.

This is one of those tests in which I pre-load my saved data set. I know from querying the data directly above that the result of my method should be 6, so the test is written in the models section to pre-load the data and run the corresponding method that I’ll call qualify, and to then expect the outcome of 6. At first, as predicted, the test fails until the proper code is written.

def self.qualify
  select(:gamedate).distinct.count/5
end

Looks pretty much the same as the SQL statement above except oriented slightly differently doesn’t it? Not a huge surprise with something like this, but this method works and the test then passes, but that, as they say, is the easy part. The more complicated part is using this information in concert with information from another database table to determine qualifying players.

A note to anyone who might be thinking ahead, I do realize I haven’t taken into account multiple seasons yet, but getting one season going is taking a bit of time

So now that we’ve figured out how many games a player must have appeared in (roughly) to qualify for the leader board, it’s time to work on the second qualification which requires a player average more than 20 minutes per game. This too is an arbitrary number that I chose, but in time I can choose to change it, and with tests set up right, I’ll be able to make sure it’s changed properly if necessary.

It’s a pretty simple matter to select records based on a the average of a specific attribute in SQL. A variety of aggregate functions, including average are built into most SQL based database management systems for just such a purpose, but when you use an aggregate you must also group the results so that a given identifier only occurs once. For example:

SELECT player_id, avg(time_played) FROM statistics GROUP BY player_id HAVING avg(time_played) > 1200

1200 is the number of seconds in 20 minutes. Because of the source data, the time a player plays is actually provided in the format of a string “MM:SS”. When I go through game information for players, I convert that string into total seconds because it’s easier to work with one column than two separate columns. It’s a rather simple matter to convert the data back to the original format, so this is how I store it.

Of course, the above query, like the first one, will only tell half the story. What I really want is to get all the players who have averaged 20 minutes per game and have appeared in that number returned by the first method we built up above.

The SQL looks something like this

SELECT statistics.player_id
FROM  statistics
GROUP BY statistics.player_d
HAVING(count(statistics.*) > (SELECT count( distinct gamedate) FROM games) / 5 AND avg(statistics.time_played)  > 1200)

The first half of that having qualification was defined up above with the Game.qualify, so now it’s matter of converting the rest of the SQL query into Active Record query equivalent.

  1. SELECT statistics.player_id FROM statistics = Statistic.select(:player_id)
  2. GROUP BY statistics.player_id = group(:player_id)
  3. HAVING (count(statistics.) > (SELECT count(distinct gamedate) FROM games) / 5 = having(“count() > ?”, Game.qualify)
  4. AND avg(statistics.time_played) > 1200 = .having( “average(time_played) > ?”, 1200)

One way to write the code is like this:

Statistic.select(:player_id).group(:player_id).having(“count(*) > ?”, Game.qualify).having(“avg(time_played) > ?”, 1200)

Now, those two having clauses chained to each other seems a little cumbersome doesn’t. Well in my research on active record queries previously, I found that the question mark can appear more than once and Rails will smartly replace them in the same order they are presented. So I think this looks a little better:

Statistic.select(:player_id).group(:player_id).having(“count(*) > ? AND avg(time_played) > ?”, Game.qualify, 1200)

Written as a method within the file corresponding to the statistic model it looks like this:

def self.qualified
    select(:player_id).group(:player_id).having("count(*) > ? AND avg(time_played > ?", Game qualify, 1200)
end

Since I know my test set, I know how many players (based on player_id) should be found within the test data, so the test was written to make sure that the qualified method returned a set with the right amount (210 out of 423, that seems about right, i think, about half the players qualify?) of results to indicate that the method as written corresponded to the SQL output of which I’m much more comfortable and experienced writing.

So that is that. Using a few different methods, I created functionality to identify qualifying players whose season to date numbers could qualify for any leader board calculations I might want to make in the future, but then again, that’s the hard part. And the hard part is for another day. Hopefully you’ll still be reading.