After nice diversion in my last article, away from the chronic model building (and the early steps of combing through data), we once again find ourselves where we have to build models, delve into data, and then parse it. This time however, I’m going to write the article so that it covers both issues.

Finding the Player Information

Getting information for teams is nice. I mean won’t you want to know the Packers point differential at home versus NFC teams at some point? Perhaps you’ll want to know the average number of yards per attempt for every pass thrown in the NFL in a given time frame. These are things I plan to have available within the application when it reaches that critical point of being able to show it to people. However, wouldn’t you also like to know how how your favorite teams 3rd wide receiver is performing. Perhaps you want to see how all the alumni from your university are doing this year in the NFL, because, let’s face it, a lot of you care more about college football than the NFL and I’m not sure I know why (especially if you didn’t attend said college). Ok, enough digressing.

So I need player information, and I need a bit more than what the super duper JSON object I’m going to use for the game information provides. The JSON object providers an id (which is not the same as the id nfl.com displays on its website URLs which is weird to me) and a player’s abbreviated name. For instance, if I played football and had some statistics, I would be identified as “J. Magee”. This limited information isn’t enough for me, so more ‘spelunking’ into the bowels of NFL.com is required to find more information about players.

Using the id provided by the JSON object, 00-0032156 is an example of what it looks like, and pairing it with the right web address information , I’ get redirected to this page which is the player information page for the current starting quarterback of the Denver Broncos, Trevor Siemian.

Note, I am not a Denver Broncos fan, nor am I Northwestern fan (in fact I’m a Wisconsin Badger fan) but the Siemian is the the first player identified in my test JSON, so don’t read any more into that if you please. Fly Eagles Fly

Let’s Learn About Trevor Siemian

As I said above, I think a little bit more information than just T. Siemian might be interesting to those who might visit my site in the future, and there is information on the page I would like to use. However, there’s also a lot of information I have no interest in or just don’t wanna worry about. However, just like when we dealt with getting the basic game information, our friend [Nokogiri](http://www.nokogiri.org] is here to help us.

For references sake, the HTML in this page only totals 3,139 lines in my Sublime Text fixture file. However, there’s only a tiny bit of information I want from this page, and when perusing this page, I discovered another page I would want information from.

Right next to the picture of that handsome devil Trevor Siemian resides all the information that I want to take and store in my application. Remember how there are 3,048 lines in the raw data. Look at how many lines I actually want, but there’s only about 20 lines that contain most of the information that I truly want to work with.

So now I knew where all the information was that I wanted, but I had to figure out how to get it. This required some review of the Nokogiri I had already used, a little emulator experimentation to figure out how to get to what I wanted, and my old friend Rubular, because, that’s right, our friend regular expressions is coming to the party again folks, but that’s to be expected in a situation like this I guess.

During my Nokogiri playing around with this file, I noticed that there was a link on the parent page that said draft. If you view the Trevor Siemian draft page, you’ll note that the same personal information I was referring to on the previous page is on this page, as well as, some draft information that wouldn’t be bad to store. In the beginning, I thought this was the right page to use to gather player information, but after some work towards that idea, I discovered a couple issues that meant the original player page was the place to start, and draft information would have to be a separate process:

  1. Undrafted players, like Davonte Lambert lack the link to a draft page.
  2. I’m not sure how long the NFL has been doing the draft information on nfl.com, but more senior players, like Drew Brees don’t seem to have a draft page either.

So now I was ready to move forward with the building (and testing) of the functionality, and with something like this you gotta build the model first.

The Player Model

The player model is pretty simple in terms of make up. At first, there seem to be no relations to build (more on that later). It’s just a matter of creating the attributes for the information I want and doing some simple validations. So based on the above research, this is what I came up with (on the first run through)

  • Name - that’s a string - because of hyphenate names, three named people, I’m not trying to break it down between first and last name at this time. This could lead to sorting issues later, but at the moment I’m ok with that. Remember, I’m just targeting the MVP for now.
  • The two id’s provided by the nfl.com web data. So there seems to be the JSON version, as a string, (“00-123456” for example), and an integer (2553457) that shows up in the web address. For thoroughness, I’m going to store both until I determine whether or not they are needed long term.
  • Height, birth date, college. This is all good stuff to know and can be helpful in research in the future. I’m not grabbing the weight because, let’s face it, that fluctuates and even athletes lie about it.
  • Draft information, year, round, pick in that round and over all pick. Due to the two issues I mentioned above, this will actually be dealt with separately.

So that’s where I started, and observant fans might notice a piece of information I missed, but I did catch it later.

There’s not a lot to say about the general model creation and testing. It’s pretty straight forward (though shoulda-matchers does not have a built in method to verify that a date is a date - so there’s that), save for a few things that might seem out of the ordinary to you, or were new for me:

  1. Height is an integer. I know the height on the page says something like 6-3, but I don’t want to store it like that. It feels incorrect to me from a database design format. It was the same issue in my nba project, so I followed the same methodology. I store height in inches. It’s an easy conversion to the standard ft-in indicator and if you want to say, find QB’s under 6 feet who threw for 400 yards in a game (no idea if there are any), it’s kind of helpful to just search an integer field for less than 72. This requires a bit of extra work on player creation but I find it worth it.
  2. The Draft round pick seems like a simple thing, but I wanted to limit any errors if people were entering data manually in the future. (I don’t see it happening because this whole project is set up to do it automatically, but still). Therefore, I had to make it a unique value, with in the draft year and round. Though I had never done this before, it seemed like a natural extension of scope by just putting the multiple scope attributes within an array and it worked like a charm. (And now that I think about it I probably should have scoped the overall pick to the draft year, but I can do that later.)

So the basic data table was set up and now I had to figure out how to get the player data.

Populating the basic player data.

Though I initially started with the draft page, those two issues came up so I went back to the original player profile page. Fortunately, both pages are laid out the same for the player information, so the code written is the same for the non-draft information. Draft information just requires a bit more information.

As always, tests were written first to define what would be successful execution of the player population code. It should be noted, that during this process, I had a few mis-fries and I did have to integrate the pry gem to find what was going wrong. I had heard many good things about the pry gem up until now but hadn’t seen a reason to use it until now. The errors the RSpec tests were running weren’t as helpful when parsing the data. Using pry to stop the execution mid-stream allowed me to examine the process step by step and see what wasn’t working.

In the end, the code below isolated the specific data portions from the player profile to build up the player record:

  new_player_information = source_data.css('div.player-info').css("strong")
  traits = {}
  new_player_information.each do |trait|
    traits[trait.content.downcase] = trait.next.content
  end

This block of code isolates all the traits available on the profile page, This isolates all the traits I spoke about on the player information page, while at the same time allowing me to only use the ones I wanted (ignoring weight for instance)

As I mentioned earlier, I also like to store height as inches, and the traits hash would store it, as a string, like this 6-3. Ruby has some excellent built in tools that make taking that string value and converting it to one single integer pretty easy. This even took a little longer, because my original scan /\d-\d/, of course, didn’t take into account people who had 10 or 11 inches in their height. A little reminder by looking at the nba application code mentioned sooner, and I was able to isolate the height from the above defined traits height by the code below:

  height = get_height(traits["height"]).scan(/\d-\d+/)[0])

  def self.get_height(player_height)
     feet = player_height.split("-")[0].to_i * 12
     inches = player_height.split("-")[1].to_i
     feet + inches
  end

This code is the end result of refactoring to make the double digit inches in a height get used properly. Zero inch heights are represented as 6-0, but they were also tested for thoroughness. This is a lesson in edge cases to me and to think through the possible options that can come up and must be accounted for in the code.

So, with those tweaks, the basic player information data entry was completed. Now, the draft data, which not all player information pages will have, had to be dealt with.

Getting player draft data.

Since not all players would have draft data, the first step in the process is of course determining if draft data is available for the player. Fortunately, the various links available on a player page are groups in the same div#id section, and Nokogiri, plus the built-in Ruby method include? allows a quick check to see if the draft link is available. It did take a little working to make sure I got the links and they were amenable to the include? method, but in the end, I was able to work it out:

  if source_data.css("div#player-profile-tabs a").to_s.downcase.include?("draft")

So if the draft link is on the page, the application will run the process of getting draft information, and if not, no harm, no foul, and off we go.

The draft information requires another call to the internet, via Nokogiri, to get the draft page information to work with, and then that data must be processed for the player. The process of doing it is quite similar to what was done for the basic player information data. However, the call to the internet and the draft data processing will be split out of the main player creation function so as to keep like things together and unalike things separate.

Getting the draft information out of the draft page is a similar process to other Nokogiri requests I’ve processed previously, but as I mentioned before there was something I had missed in my initial planning for the player model, and I only realized it as I was building the code for processing the draft information (did you figure it out?). I forgot that ever player draft is drafted by a team, and that information should be stored along with the other draft information.

Obviously, the team that drafted a player is available on the draft information page I’m processing, but it does require a little work. The team information is available purely as text, and while it might seem instinctive to just go with the first word as the city identifier, that won’t work, because certain teams have multiple words in the city (Kansas City for instance) and certain city references might refer to multiple teams (New York and Los Angeles), so instead I wanted to focus on the team nickname. I can easily isolate the last word in the text of the team drafting the player, and while no teams currently have a two word nickname, it’s not out of the question that it could happen, and so even that should be watched. Thus, after isolating the last word in the text string of the drafting team, I used a query with like to get the team:

   Team.find_by('nickname like ?', "%#{teamname}").id

At this point, it’s not really necessary as all NFL teams have a single word nickname, but this should handle multiple team nicknames in the future.

The standard process of writing tests and then making them pass was followed and the draft information insertion was completed, and with that, the player processing information was completed.

Much like I needed the team information before I could enter participant information. Player information needed to have a way to be entered before individual player statistics can be added, but, now, finally, I am to the point where the player statistics, by game, can be entered, and that’s the information I want to be able to get to get to the MVP state, so that’s what is up next. Though, unlike the nba, this will be a slightly more complicated process that might take a bit longer, but I will write about it along the way.

This article might be the most disjointed article I have written so far and I do apologize. The 100 days of code challenge I am participating in has caused some issues in making sure I stay on track with writing as I go, which is my favorite way to write, so this article was finished long after the code was written, and the mid-stream correction I spoke about was also something that since not documented the moment it happened, isn’t clearly spelled out. I apologize for that and hope to be more thorough in the future