Today is November 5, 2016. I have been adrift recently, doing a bit here and a bit there, but nothing has really been progressing, either with projects, or learning, and obviously articles. Tomorrow is the 8th game of the season for the Philadelphia Eagles, the NFL team that I root for, and it was rattling around my head to do a little box score analysis, as the season would be half over. I wanted to look to see if any trends from the first 4 games and the second 4 games had become apparent. It’s possible that nothing would be visible, or maybe something would be.

Of course, there’s the easy way to do this to wait until after the 8th game is finished, find the box scores for each of the 8 games, copy and paste the information into excel and manipulate it to see if anything is there but, the way I think, that is a flawed idea for a lot of reasons. Firstly, only looking at games played by the eagles would be slanted, there’s no objectivity or entirety to look at, as I see it. How do these numbers compare with the rest of the league, are there any useful numbers that distinguish the Eagles, good or bad, from the rest of the league. Thus, the brain evolves to the point where I want to be able to get box score information (at the least) for all games and work from there.

The beginning of any project like this first requires the accumulation of data, and to do that, you have to find the data and then figure out a way to capture it. This project is slightly different than my NBA project as different information is available for NFL games, and (in the long run) the play by play information is much more vital to get, but more complex to deal with. Thus, I wanted to start with box scores to see how it would progress, so away we go.

Step 1 - Identifying the Source of the Raw Data

If you go to any web site for any sports league you’ll see a pretty presentation of the game data (often called the box score) for all their games. While it is good to look at, and their are ways you can parse them, if you want, I’ve found that often the source that was providing the raw data is easier to work with if you have patience and time to figure them out.

Why do they require patience and time? Well, I could tell you, but instead I’ll provide you two links:

  1. On this part Thursday, the Atlanta Falcons and Tampa Bay Bucaneers played a football game. The ‘user friendly’ version of the results can easily be found on NFL.com. It’s a nice layout, mostly. You can quibble about colors and size, but in the end, all the information anyone would want to just look at the box score is easy to find. However, I want to do more than just look at these numbers, I want to manipulate them, and that leads to…

  2. This link here is the source data that is fed to your browser and processed to present the pretty box score you saw in one. This is what I want. It’s dirty, it’s raw, and it’s in JSON, so I can play with it.

The two items available have one key thing in common, and it may seem a bit odd, but 2016110300 is the unique identifier applied to this game by NFL.com that links the pretty page and the raw data. This is the first piece of useful information in our research. Thursday was November 3rd, so what this identifier says to me is that this was the first game (00) play on November 3, 2016. (For those who don’t know most counting in programming starts at 0 not 1, that’s why 00 is the first game). To test my theory, I took the link from #2 and replaced all references to 2016110300 to with 2016102700, and it returned this, which is the raw data for the previous Thursdays game. I note for future reference.

Step 2 - Parsing the Raw Data

I’ve spoken about JSON before and dealing with it, but a quick refresher is that JSON stands for JavaScript Object Notation, and as such not only follows rules it has much built-in functionality in my favorite object oriented language because it also appears to Ruby as a Hash.

In programming, there are arrays. Arrays are collections of data that are purely identified by their position in the group. For example:, x = ["a", "b", "c", "d"] is a simple array (x) of the first four letters of the alphabet. You access members of an array purely by their location (referred to as an index), but remember, counting in these languages starts at zero. So, say you want the third member of the array. That equates to the second index, so x[2], would return c.

Hashes are similar to arrays but they have a bit more structure and less dependence on order. The hash that would represent me would perhaps be john = {name: "John Magee", age: 44, height 72, favorite_color: "green"}. Accessing parts of a hash have the same basic set up as the array above, but instead of a number, you pass in what is called the key to retrieve the value. For example john[“height”] would return the value 72 (I’m 6’ tall, but it’s easier to work in inches and then convert for reasons not relevant right now, but trust me).

So, when I process that really large piece of information from 2 above through a JSON parser (included in the ruby JSON library so I don’t have to do it manually) I get a hash, but as you probably expect, it’s a very large hash. There are commands built into ruby that allow you access to various information that allow you to work with the hash. One of those commands is keys. For instance, if i typed john.keys, based on the hash above I would be given the following:

  • name
  • age
  • height
  • favorite_color

(It wouldn’t be laid out this way with bullet points, this is just for example). So, the first step in working with an unknown hash, is to identify the keys.

To examine the downloaded hash and it’s values I load it into a handy dandy ruby emulator called irb that all us ruby users have access to and proceed to analyze the content of my downloaded hash.

The hash only has two keys, one is the unique identifier above, 2016110300, and the other is called nextupdate. I look at the contents of nextupdate and it is purely a number value so it’s not relevant to what I’m doing at this time, so i then print out the values of the first key, and unsurprisingly it’s a complex set up and using the built-in class method to ruby, I discover that that the value of the key is also a hash. (That’s right, you can have a hash within a hash). This isn’t that surprising to me, it’s how the NBA statistics worked as well, and is the power of JSON. There will be multiple levels of hashes linking the appropriate information together until you get down to the game specific data, so it is just a matter of drilling down through Hashes, Arrays, and Keys until you not only find the raw game data you want but you know exactly how to access it.

The long term goal, similar to the NBA one, is to automate the downloading and processing of the box scores, but that can only be done after the identification of how to navigate the downloaded raw data.

The ‘second level’ has has a variety of keys, some of which might come in handy, some of which might not, but for posterity, I’m going to list them here:

  • home
  • away
  • drives
  • scrsummary
  • weather
  • media
  • yl
  • qtr
  • note
  • down
  • togo
  • redzone
  • clock
  • posteam
  • stadium

A quick scan through the contents of most of these keys reveals that the useful information is only up at the top. Most of them are actually empty after the game as over as this is also used as a ‘live’ box score, and once the game is over, many of the information above is no longer relevant to display or keep. That’s not surprising to me (though game attendance would be nice, the NBA gives it, this one doesn’t, but it’s ok, it’s not an important piece of information) as I expected most of what I would be looking for would be in the home and away keys. The drives hash was surprising. It contains the play by play of the game, which means if I ever want to try and work through that, I can using the data already downloaded. (The NBA puts play by play somewhere else).

After more drilling down I was able to find where the information I want is stored, which is a great first step. Using our example game, if you wanted access to the raw data for the rushing statistics for the home team, you must type the following starting with the parent hash `parent_hash[“2016113000”][“home”][“stats”][“rushing”]. However, even that isn’t fully ‘processed’ as most teams would have multiple rushers, so you still are given a hash like this:

{"00-0032741"=>{"name"=>"P.Barber", "att"=>11, "yds"=>31, "tds"=>0, "lng"=>8, "lngtd"=>0, "twopta"=>0, "twoptm"=>0}, "00-0026855"=>{"name"=>"A.Smith", "att"=>5, "yds"=>25, "tds"=>0, "lng"=>8, "lngtd"=>0, "twopta"=>0, "twoptm"=>0}, "00-0031503"=>{"name"=>"J.Winston", "att"=>2, "yds"=>14, "tds"=>0, "lng"=>14, "lngtd"=>0, "twopta"=>1, "twoptm"=>0}, "00-0030391"=>{"name"=>"M.James", "att"=>1, "yds"=>3, "tds"=>0, "lng"=>3, "lngtd"=>0, "twopta"=>0, "twoptm"=>0}}

However, even though we are still dealing with a hash, we have found the source of what we were looking for, the individual statistics for players that can be then abstracted and used for future analysis.

I think that is it for now, I’m going to try a small chunk process for now. I hope you enjoyed reading and will come back soon. Feel free to comment, usefully please. Though it’s a primitive set up, I will delete all spammy type comments.