Site icon R-bloggers

RGhcnV3 A new package

[This article was first published on Steven Mosher's Blog, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

It’s been a long journey and there are some people to thank for helping me along the way. Steve McIntyre, Ron Broberg, Jeff Id, Ryan ODonnell, RomanM, Nick Stokes, Robert Hijmans, Gabor  Grothendieck, Hadley Wickham, David Winsemius, and countless others on the R Help list.

The Package is done.

Package: RghcnV3
Type: Package
Title: Global Historical Climate Network Version 3
Version: 1.0
Date: 2011-06-16
Author: Steven Mosher
Maintainer: Steven Mosher <moshersteven@gmail.com>
Depends: R (>= 2.13.0), R.utils, R.oo, R.methodsS3, zoo, raster, sp, rgdal
Suggests:
Description: The Rghcn package provides the core functions required to
  download the GHCN V3 data and process it into temperature anomalies.
  In addition, there are a few core functions required to download and
  create land masks.
License: GPL (>= 2)
URL: http://stevemosher.wordpress.com/
LazyLoad: yes
LazyData: no

Over the course of the last year or so I  been learning R and aiming at building a package for working with GHCN data.  I have probably written close to 10 different versions of the package tryin to get it down to something that was clear and clean. More importantly, I wanted to leverage the existing Open source resources– the existing packages in R.  Over the course of that time I’ve played around with S4 classes and R.oo for  object oriented design.  At some future date I’ll probably switch the design over to OOP, but for now, it’s plain old vanilla R.

Currently, I’m in the final testing of the package which should probably take a day or two and then I’ll be uploading it to CRAN. I may also decide to write a vignette for the package, but I haven’t decided on that yet. Maybe a blog post first.

The key to getting this package down to a few minimal calls is the leveraging of existing R packages. Let me take a minute to talk about them and how I use them. The first package is “zoo”  maintained by Gabor.  When you get down to the bottom of GHCN data it is nothing more than a collection of time series, regular time series. That means regularly spaced time based data. For that type of data “zoo” is the correct package.  If you know anything about GHCN data you know that the storage format is very dense. It basically goes like this:

ID YEAR JAN,FEB,MAR, etc

So that for a given station ID you have entries by year. And every year has monthly data.  The big problem?  Missing years.

4251234500  1900 12 12 13 14 15 15 16 15 13 13 12 14

4251234500  1901 12 12 13 14 15 15 16 15 13 13 12 14

4251234500  1908 12 12 13 14 15 15 16 15 13 13 12 14

4251234500  1909 12 12 13 14 15 15 16 15 13 13 12 14

Where it not for these missing years ( 1902-1907) it would be an easier matter to turn this N *12 array into a vector with every month represented. That is, if GHCN data had NAs in all 12 months of missing years, one could merely reshape the  n*12 matrix into a long vector or long time series. However, that’s not the case. So one has to unfold that matrix and insert NA years into the spaces where it is required. Turns out that’s really easy in Zoo and no loops required. What we end up with is a  dataframe or matrix type object where every column is a station and rows contain the temperature data. That data structure  a dataframe of zoo time series can  be manipulated by all sorts of zoo functions. zoo functions are functions targeted at time series analysis. So I can do “windowing”,  filtering etc etc.  Transforming my GHCN data into a zoo object then gets me huge leverage. I can use all the tools of time series analysis from zoo.

The next package is raster. raster is maintained by Robert Hijmans.  raster is package  devoted to spatial analysis. Think of a raster like a giant spatial grid. That’s what they are. Now extend that grid into the 3rd dimension (time) and you have an idea of the final data structure that temperature records go into. They are time series located in space.

The raster object for that is called a brick. So at a very high level of data abstraction the RGhcnV3 package consists of nothing more than mashing a 2 dimensional zoo object into a  3 dimensional raster object.  The result is stunningly simple because once the spatio temporal temperature records are in a raster brick ( lat/lon/time) then all our processing can happen through calls on raster objects.  So the RGhcnV3 package will consist of a very limited number of routines to get data down from the internet, then on to your local disk, then into a zoo object, then into a raster brick. From that point on all your programming happens in the raster package.

As I sit here I know there are a couple little fiddles I would like to do to make the process even slicker, but I’m going to resist that urge for now.

So a couple days of testing and then CRAN. the package builds and passes CRAN checks. manuals are done. I’ll post more in a couple days


To leave a comment for the author, please follow the link and comment on their blog: Steven Mosher's Blog.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.