Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
You may know that I am a fan of the CivicSpace US ZIP Code Database compiled by Schuyler Erle of Mapping Hacks fame. It contains nearly 10,000 more records than the ZIP Code Tabulation Areas file from the U.S. Census Bureau upon which it is based, so a lot of work has gone into it.
I have been using the database a lot recently to correlate with survey respondents, so I have saved it as an R data.frame. Since others may find it useful, too, I have packaged it into the ‘zipcode’ package now available on CRAN.
One you load the package, the database is available in the ‘zipcode’ data.frame:
> library(zipcode) > data(zipcode) > nrow(zipcode) [1] 43191 > head(zipcode) zip city state latitude longitude timezone dst 1 00210 Portsmouth NH 43.00590 -71.0132 -5 TRUE 2 00211 Portsmouth NH 43.00590 -71.0132 -5 TRUE 3 00212 Portsmouth NH 43.00590 -71.0132 -5 TRUE 4 00213 Portsmouth NH 43.00590 -71.0132 -5 TRUE 5 00214 Portsmouth NH 43.00590 -71.0132 -5 TRUE 6 00215 Portsmouth NH 43.00590 -71.0132 -5 TRUE
Note that the ‘zip’ column is a string, not an integer, in order to preserve leading zeroes — a sensitive topic for those of us in the Northeast…
The package also includes a clean.zipcodes()
function to help clean up zip codes in your data. It strips off “ZIP+4″ suffixes, attempts to restore missing leading zeroes, and replaces anything with non-digits (like non-U.S. postal codes) with NAs:
> library(zipcode) > data(zipcode) > somedata = data.frame(postal = c(2061, "02142", 2043, "20210", "2061-2203", "SW1P 3JX", "210", '02199-1880')) > somedata postal 1 2061 2 02142 3 2043 4 20210 5 2061-2203 6 SW1P 3JX 7 210 8 02199-1880 > somedata$zip = clean.zipcodes(somedata$postal) > somedata postal zip 1 2061 02061 2 02142 02142 3 2043 02043 4 20210 20210 5 2061-2203 02061 6 SW1P 3JX <NA> 7 210 00210 8 02199-1880 02199 > data(zipcode) > somedata = merge(somedata, zipcode, by.x='zip', by.y='zip') > somedata zip postal city state latitude longitude timezone dst 1 00210 210 Portsmouth NH 43.00590 -71.01320 -5 TRUE 2 02043 2043 Hingham MA 42.22571 -70.88764 -5 TRUE 3 02061 2061 Norwell MA 42.15243 -70.82050 -5 TRUE 4 02061 2061-2203 Norwell MA 42.15243 -70.82050 -5 TRUE 5 02142 02142 Cambridge MA 42.36230 -71.08412 -5 TRUE 6 02199 02199-1880 Boston MA 42.34713 -71.08234 -5 TRUE 7 20210 20210 Washington DC 38.89331 -77.01465 -5 TRUE
Now we wouldn’t be R users if we didn’t try to do something with data, even if it’s just a lookup table of zip codes. So let’s take a look at how they’re distributed by first digit:
library(zipcode) library(ggplot2) data(zipcode) zipcode$region = substr(zipcode$zip, 1, 1) g = ggplot(data=zipcode) + geom_point(aes(x=longitude, y=latitude, colour=region)) # simplify display and limit to the "lower 48" g = g + theme_bw() + scale_x_continuous(limits = c(-125,-66), breaks = NA) g = g + scale_y_continuous(limits = c(25,50), breaks = NA) # don't need axis labels g = g + labs(x=NULL, y=NULL)
If we make the points smaller, cities and interstates are clearly visible, at least once you leave the Northeast Megalopolis:
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.