[This article was first published on From the Bottom of the Heap - R, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
A while back I wrote some code to download climate data from Government of Canada’s historical climate/weather data website for one of our students. In May this year (2016) the Government of Canada changed their website a little and the API code that responded to requests had changed URL and some of the GET parameters had also changed. In fixing those functions I also noted that the original code only downloaded hourly data and not all useful weather variables are recorded hourly; precipitation for example is only in the daily and monthly data formats. This post updates the earlier one, explaining what changed and how the code has been updated. As an added benefit, the functions can now handle downloading daily and monthly data files as well as the hourly files that the original could handle.
The genURLS() function now has an extra argument timeframe which allows you to select which type of data to download, defaulting to hourly data:
If we wanted all the data for 2014 for the Regina RCS station then we could generate the URLs we’d need to visit as follows
The function that downloads and reads in the data was getData().
The main infelicity is that you have to supply the getData() with a data frame containing the station IDs and start and end years respectively for the data you want to collect. This suited my needs as we wanted to grab data from 10 stations with different start and end years as required to track station movements. It’s not as convenient if you only want to grab the data for a single station, however.
getData() gains the same timeframe argument as genURLS(). In addition, to handle the, quite frankly odd!, choice of characters used in the various flag columns, I now do conversion of the file encoding from latin1 to UTF-8 using the iconv() function. Whether this works portably or not remains to be seen — I’m not that familiar with file encodings. If it doesn’t work, an option would be to determine what the user’s locale is and from that change the encoding to the native encoding.
One thing you’ll note quickly if you start downloading data using this function is that the web script the Government of Canada is using on their climate website will quite happily generate a fully-formed file containing no actual data (but with all the headers, hourly time stamps, etc) if you ask it for data outside the window of observations for a given station. There are no errors, just lots of mostly empty files, bar the header and labels.
One other thing to note is that getData() returns the downloaded data as a list and no attempt is made to flatten the individual components to a single large data frame. That’s because it allows for any failed data downloads (or reads) and records the failed URL instead of the data. This gives you a chance to manually check those URLs to see what the problem might be before re-running the job, which because we saved all the CSVs will run very quickly from that local cache.
The use of data.frames internally is showing signs of being a bit of a bottleneck performance-wise; rbind()-ing many stations or files of data takes a long time. I plan on changing the code to use tbl_dfs now that Hadley has moved that functionality to the tibble package. I am reliably informed that bind_rows() is much quicker.
The eagle-eyed among you will notice the dreaded stringsAsFactors = FALSE in the definition of getData(). I’m beginning to see why people that work with messy data find the default stringsAsFactors = TRUE down right abhorrent!
To see getData() in action, we’ll run a quick job, downloading the 2014 data for two stations
Regina INTL A (51441)
Indian Head CDA (2925)
First we create a data frame of station information
Then we pass this to getData() with the path to the folder we wish to cache downloaded CSVs in
This will take a few minutes to run, even for just 24 files, as the site is not the quickest to respond to requests (or perhaps they are now throttling my workstation’s IP?). Note I turned off the printing of the progress bar here, only because this doesn’t play nicely with knitr’s capturing of the output. In real use, you’ll want to leave the progress bar on (which it is by default) so you see how long you have to wait till the job is done.
Once this has finished, we can quickly determine if there were any failures
If any had failed, the failed logical vector could be used to index into met to extract the URLs that encountered problems, e.g.
If there were no problems, then the components of met can be bound into a data frame using rbind()
The data now looks like this
Yep, still a bit of a mess; some post processing is required if you want tidy names etc. The columns names are hardcoded but retain the messy names as given to them by the Government of Canada’s webmaster. Cleaning up afterwards is remains advised.
A final note, I could have run this over all the cores in my workstation or even on all the computers in my small computer cluster but I didn’t, instead choosing to run on a single core overnight to get the data we needed. Please be a good netizen if you do use the functions I’ve discussed here as other people will no doubt want to access the Government of Canada’s website. Don’t flood the site with requests!
If you have any suggestions for improvements or changes, let me know in the comments. The latest versions of the genURLS() and getData() functions can be found in this Github gist.