If you read part 1 of this, at this time nascent, NFL project, you read how I worked to find the specific data from one NFL game to isolate the boxscore information. Continuing the delving into the hashes, I was able to dive deep enough into the parent information to find the individual player information needed to re-create a box score and perhaps do a bit more analysis from a basic point of view, and while that is a great first step, that’s just one game. One game that I found manually. To do any sort of analysis you need a lot of games, and to save time and aggravation, you need to able to get those games more easily then hunting down the unique id of every game of every season.
Fortunately, since there are patterns and consistency within how the NFL presents its game results, and there are existing tools to take advantage of that, the problem of finding a lot of games and getting them automatically is not all that impossible, though again time consuming.
So let’s get started
Step 1 - Finding the information you need
This link is the page on the NFL website that lists the final score for all 16 games played in week one of the 2016 regular season. If you peruse this page just a bit you’ll see that each individual game has a button that says Game Center. This button will take you to the place where you can find all the information specific to that game. And like in part 1, we learned that each game has a unique identifier that can be used to access the information in other ways. So, this page does have the information I need to get the unique identifiers for every game played that week. Additionally, if you look at the link itself, you can see a pattern that would allow you to easily determine the right page for, say, week 4 of the 2015 season if you decided you would want to download that data, however, that’s for later, and pretty simple.
While all the information is available on the page when you look at it, it’s not as simple to access it. This page is the source code of the pretty page you were looking at. In short, source doe is the programming as it were that tells your browser what to display and how to format it. It’s kind of messy and full of extraneous information, but this is where the information that I need is, and I needed to figure out how to extract it, and as usual Ruby has a nice little freely available tool to do this, and so I had to learn a bit about Nokogiri
Step 2 - Extracting the links I need with Nokogiri
Nokogiri is a very popular (it has 4,103 stars and 567 forks at the time I’m writing this, which for those of you who aren’t github users, is a sign of popularity) gem for Ruby for parsing raw HTML (and XML) data. This data is different than the JSON data that I was parsing in a previous article, but this information isn’t as readily available in JSON format because it’s pretty static information and easier to present. In searching, I saw no JSON option for obtaining the unique identifiers. Though, truthfully, as a learning tool, using Nokogiri instead of JSON allows me to work on some new skills.
Nokogiri does a lot of marvelous wonderful things, but what I’m focusing on at this point is the ability to scan through HTML (us developers sometimes call it scraping) and find specific information. When you write HTML you often will apply special formatting to similar content so that it all looks the same and presents a pleasing appearance to the user. This is done through Cascading Style Sheets (CSS) and the application of specific classes that are blocks of CSS code that allow you to easily apply identical styling to various parts of a page with a simple statement of
class="css-class-name". Nokogiri has the built in ability to find not only specific elements (links, rows, buttons, etc…) but also classes that can be applied throughout that source code you were looking for. With that piece of information, I just needed a way to find any kind of class information that was applied to those Game Center buttons that wasn’t applied anywhere else in the page. (I mean, I could deal with other content, but if there is something specific only to the Game Center button, it does save time).
Looking through this introduction to Nokogiri gave me enough ammunition to:
- Get the raw HTML of the weekly results page linked above
- Use Nokogiri to find a way to isolate the links that had the information I needed.
If you took the time to scroll through that link to the source page data, you would have seen a lot of information crammed together. When you use Nokogiri to process it, it becomes even more crowded (though easier to work with) so the first step was to work through was finding the identifier that would give me only the links I wanted.
Reading through the linked primer of Nokogiri, I was able to find all link elements on the page (there are a lot of them) and a quick scan through them found that all the links to the Game Center for a specific game had a specific class assigned to them. (This is unsurprising since they have formatting not seen anywhere else on the page). Thus, I could isolate the link to each Game Center by using Nokogiri and knowing that specific class.
RawData.css('a.gc-btn')["href"] would isolate the web address of every Game Center in a given week, and this is great information to have, but I still need to do a little more work to drill down because there is only one piece of information I need to extract from that entire web address that looks like, and that means working with my nemesis, regular expressions.
Step 3 - Extracting the Game Identifier from the links using regular expression.
The link to the Game Center looks like this:
http://www.nfl.com/gamecenter/2016090800/2016/REG1/panthers@broncos?icampaign=GC_schedule_rr. I only need that long string of numbers in the middle: 2016090800. As stated in part 1, this is the unique identifier for the game that would allow me access to the box score, and as stated earlier in this article, having to manually extract it for up to 16 games every week is a time consuming and numbing process that could cause a project to be less interesting. Not to mention that if I want any historical data, it would take a very long time to manually search through all the pages to get the information, and that just isn’t going to be productive. So, even though I’ve isolated all the links, what I really want is that very specific number, and that’s where regular expressions come in.
If you don’t know what regular expressions are, consider yourself lucky that you’ve never had to learn about them, but I will give a brief explanation. Basically, regular expressions are a way to find patterns within text. If you wish to know more, wikipedia always has more. I have a long history with regular expressions, most of it frustrating, but I have over time become at least comfortable enough with them that I can use them when I need to. There is a great website called rubular where you can put in a test string (like the link I above from which I need the unique id) and then construct the regular expression to match the pattern I need. After a little experimenting, I came up with this regular expression:
/\/\d+\//, which successfully isolates
/2016090800/ from the entire link. Now, this is almost exactly what I need but those pesky forward slashes (and they are forward slashes, \, that, is a back slash) are in the way. Fortunately for me, ruby has a nice built-in method that allows me to easily deal with this issue.
Gsub is a method built-in to Ruby that will search for patterns in a supplied string (like regular expressions) but allow you to also manipulate the found patterns if you want to. There are lots of options as you’ll see if you click the link to the guide, but the one I’m dealing with is the replacement. Using gsub, you can replace something that exists with nothing using “” as the replacement. Hence
.gsub("/","") will replace every instance of the forward slash with what’s between the two quotation marks, i.e. nothing. So if I apply that gsub statement to my regular expression search I will successfully isolate 2016090800 as a string, which is more than enough to then plug it into another method that I build on my own to download the box score information for that game.
Of course, this is still just a small step in the development of the project, but a vital one that will allow later more complex programming to be done.
Hopefully, I’ll get there soon and you’ll read about it.