Those who have previously perused my publications have read about my attempt to analyze NBA game data for my own purposes and pleasure. While the project no longer would be as relevant as it was in the past, the idea, from a learning point of view, is to me still interesting enough to pursue, and as you may remember, holding my interest is always a risk when I start to build a project.

To this point, I’ve published two articles (the model testing ones: part 1 and part 2) about building the foundation of this project, and then retroactively testing it. It’s 3 months since I wrote those articles and I haven’t really worked on the project since because I was at another sticking point that I couldn’t see my way around in terms of testing.

In my mind, when you’re writing your own code, you really should test everything. Even the actions/methods/views you write every time, like your basic CRUD, you should test for thoroughness and to make sure you don’t miss anything. Sometimes when you’re so used to doing things that they seem basic, you often miss a simple step. Following proper TDD protocol means not only will you not miss those basic steps, but as your application gets more complicated (and requires deeper testing) you are also making sure your basics work, and continue to work, not matter what you put on top of them.

The tests I wrote about earlier in the two model testing articles only tested the insertion of the data. Given the proper raw data, did my code put the right data in the right place while still making sure that (where I written checks against it) duplicate data wasn’t entered. All these tests passed, but really, what good is data if you can’t access it and manipulate it in ways you want. So, that’s the next step. Testing the queries.

I’m going to be geeky here for a minute, like real geeky not the cool trendy geeky. I love writing queries. In SQL (Structure Query Language), queries are how you access the data, select the data, and present it exactly as you wish. Sometimes writing a query can be pretty easy, other times it can be a chore. The chore can come from the fact that you’re working on a large data set that you’ve never seen before and have to figure out how it all goes together, or perhaps it’s poorly organized data and you have to clean it up first, or perhaps it’s information you organized but it’s a pretty complex query that might take you a while to figure out.

My favorite query I ever built started with an excel spread sheet that had the daily results of every baseball game played in a given season and I was asked to write a query to get the daily standings for ever division. It was probably over a decade ago that I was asked to do it, and it took me like 2-3 hours to get it right, but I felt pretty damn awesome when I finally finished it. (Work like that is kind of what led me to wanting to do the NBA thing, there weren’t a lot of people publicly working on NBA stuff like they were MLB stuff at the time, plus I just enjoy watching basketball)

Rails, as it always does, understands the importance of query writing in almost all applications so has built in functionality that allows you to write code that translates into the proper SQL Query. This is useful not only as it allows you to write code that will work on many RDBMS (Relational Database Management Systems, you probably have heard of at least one of MySQL, Oracle, PostgreSQL, SQL Server, all based on the ‘rules’ of SQL but all having their own quirks) without altering the source of the code you write, but it also is a built in insurance against a type of hacking attempt called SQL Injection. If you write Rails code that translates into SQL - then anyone trying to insert straight SQL into your system is going to have a problem. (As always with Rails, you can write straight SQL and have Rails use it properly if you want, and sometimes you have to, but for now I’m trying to use the built-in functions)).

So that’s where I am now - I want to write the queries, simple and complex, and test them, but that is where the sticking point comes in.

Test Data in Rails

Rails makes testing your web site, including user input / interaction, quite easy and clean from the beginning. When you run your initial command to create your base Rails application, three environments are established within the foundation for the developer to work with. There’s the production environment, that’s the live version when you put your app up on the web. There’s the development environment where you can write and run your code to see what it looks like and how it functions without changing the production environment. Finally there’s the test environment. My understanding of the test environment is limited. I don’t know everything it does, but the way I understand is that the test environment allows you to ‘run’ your development environment and based on how you test you can simulate web interaction. You can click links, you can fill in forms, you can change records, you can test almost any functionality you want as long as you know how. One of the tricks about this is that each environment has its own database. The database in each environment has the exact same structure but changes to the database in the development environment would not change the production database and the test database doesn’t impact any of the either two either.

The test database also has one other, usually, helpful function. So as not to pollute future tests or changes, the test database resets after every test run. By resetting, I mean it dumps all the data contained or created during the tests being run and goes back to empty. Most of this time it isn’t a big deal, though, this also means it dumps seed data and if you’ve read other articles here you may have happened upon my struggles with that issue. This dumping of the data can also cause a problem if, like me, you want to test a large set of data that you can’t quite yet figure out how to seed properly.

In some random research for this idea, before I was really ready to work on it, I did discover that you can write your seed file so that it only seeds in specific (like test) environments. This does solve one problem, however a second problem presents itself. I want to thoroughly test a variety of queries both simple and complex and have enough data to make sure it’s working. I figured this would take a months worth of games. To run the methods that go out to the web and populate the data would make tests take a really long time to run and possibly wouldn’t even run properly within the test environment (see articles referenced above for some issues involved). The seed data file does contain a method for populating data from the previous 4 seasons, and could be altered for a shorter time period, but still, I just don’t want to think how long it would take.

(If you’re thinking why not just use the Faker gem, for those of you who know what the Faker gem is, I want to write queries that calculate a variety of specific numbers based on simple and more complex calculations, I can only know if they work by knowing what their outcome should be)

As I sat down today to think (and write) about this, I realized that I hadn’t thought about the other way I might be able to do this. Most RDBMS have an export function of one sort or another. If I loaded the test data into the development database, and then exported it, perhaps I could load it into the test database each time I needed it. It might still be time consuming, but it would have to be less time consuming than running all the queries out to the web a months worth of games would entail, so that’s what I was going to try.

Postgres and Saving Test Data

As I mentioned earlier there are a lot of RDBMS out there each with their own quirks. For reasons of job experience (and existence at the time), I’m most fluent in MySQL and SQL Server. However, a lot of Rails hosting is built using PostgreSQL as a database back end. For instance, this website is currently hosted on Heroku which does a lot of the set up work for you, but you have to use Postgres. (Other places will allow you to set up your entire thing soup to nuts anyway you want, but I’m not really ready for that just yet, Heroku works for me for now). In that vein, I chose Postgres as the database in my development and test environments when I started using working seriously on this project so that it would just work from the start. (By default, Rails uses a database called SQLite, which as the name implies is a light version of a SQL RDBMS. It’s perfectly acceptable for development and testing but really doesn’t work for most heavy duty apps, though iPhone apps use it quite a bit I’m told). This database chase makes setting up your application a little more laborious, but once you get it done it doesn’t really make much of a difference if you’re using Rails to write all your queries. There are some slight idiosyncrasies that could come into play if you write straight SQL queries (which I”m trying to avoid), and of course, the way you execute certain, shall we say ‘system specific’ tasks are going to be specific to the platform. Each of these systems, though, usually has a way to export, and import, data. And it’s this functionality, that I finally remembered existed today, that I want to try and exploit as a way to have test data to work with readily available.

My Idea to Get My Test Data

  1. Download the first months games of the 2015/2016 NBA season (well, the few games in October and all the games in November)
  2. Use postgres functionality to export the data from the tables I need it from in nba_development and to upload it into the nba_test database.
  3. Create a helper method in RSpec that will allow me to load the data only within the test files that I’ll actually need it.

Obviously this has to be tested as well (because a month worth of games will take a long time to download, so it’s a ‘set up before bed’ kind of activity and you want everything to work before that), so I’m going to try first with only two days worth of data. Obviously though, since my experience working directly with PostgreSQL (instead of through Rails) is very limited, it requires me to research and play with some of that functionality until I get it right.

Downloading the Games

As stated above, my seed file already has code written that allows users to download a few seasons games if they want to. It’s a matter of adding a line with a different start and end date. A quick edit to create an import of a couple days games, and running the seed file, and we should be good to go.

Except sadly, we weren’t that good to go.

I won’t get into the complications of it, but when your date is formatted ‘YYYY-MM-DD’ as it needs to be to get the data, you can’t just add 1 to get the next date, which I really should have known. A quick thought about how to fix it, and test to see if the plan worked, and I am ready to go.

The Joys of the pg_dump

Nope, that’s not a typo folks who don’t know postgres. The command for exporting data out of postgres is the command line entered 'pg_dump. (I realize that there’s a history of the word dump in terms of data and such, but rally, was pg_export just now allowed to be used?). One of the great things about all this open source stuff that now exists is that they usually have some pretty good documentation (and or tutorials) on functionality, and Postgres is a fine example. The entire documentation for its functionality is available online including that for our buddy pg_dump.

As with most command line tools, pg_dump has a variety of options and machinations a user can go through to achieve their desired goal. I won’t go over the machinations and trial and error required, but as my goal is just to get data from certain tables to populate my test database when needed, I came up with (and successfully tested functionality) this:

pg_dump -at games -t participants -t statistics -t players nba_development > spec/dumps/test.sql

That one command pulls all the downloaded data that would need to be tested. (A thought occurs to me that if I changed the dump into multiple files, I could have less load time for instance when I’m only testing interactions between games and participants, but for now we’ll just go with this).

So, that was done, and now I had to figure out how to ‘reverse the dump’, and, hopefully that will be pretty easy with another built-in Postgres functionality.

The Fun of pg_restore

Running through all the options of the pg_dump counterpart pg_restore, I thought I had happened upon the proper way to load the dumped data into the test database via the command line using:

pg_restore spec/dumps/test.sql d nba_test. However this threw an error, but it through a pretty useful error, because it politely suggested that I use psql. Psql is a pretty versatile little quartet of letters I’m coming to find. A quick search through the tutorials found this early on tutorial that provided the proper way to upload the data from my export file:

psql -d nba_test -f spec/dumps/test.sql

And that, my friends, did the trick. A quick select * on the four relevant tables in the nba_test database showed that all the data was available. That is also where I’m going to end this post. Step 3 of building the helper function I believe I will put into a separate post as I think I need to start making these posts shorter (and more frequent as well).

Until we meet again, zut alors.