Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
I am very intrigued by the Infochimps Geo API, so wanted to play around with it a little bit and pull the data into R. I’ll start by getting data from the American Community Survey Topline API for a 10km area around where I live.
First some setup code here. It imports a couple libraries that we’ll need (RJSONIO and ggplot2) then sets up some variables that we’ll use later to construct the REST call into Infochimps Geo.
library(RJSONIO) library(ggplot2) api.uri <- "http://api.infochimps.com/" acs.topline <- "social/demographics/us_census/topline/search?" api.key <- "apikey=xxxxxxxxxx" radius <- 10000 # in meters lat <- 44.768202 long <- -91.491603 columns <- c("geography_name","median_household_income", "median_housing_value", "avg_household_size")
Note: if you want to use this code you’ll need to remove the x’s in the api.key and replace it with your Infochimps API key.
I am going to pass a latitude, longitude and radius into the API and it will give me back data for just that geographic area. You can specify the geography in a number of ways (see API docs for more info)
Next we have to construct the URI to call the API, retrieve the data (which comes in as JSON) and convert the JSON object into an R object using RJSONIO.
uri <- paste(api.uri, acs.topline, api.key, "&g.radius=", radius, "&g.latitude=", lat, "&g.longitude=", long, sep="") raw.data <- readLines(uri, warn="F") results <- fromJSON(raw.data)
Next, we need to do some manipulation on the retrieved data to get it into a form that’s easier to deal with. I like using data frames a lot, so I’ll turn it into a data frame by
ml <- lapply(results$results, function(x) x[columns]) mm <- matrix(unlist(ml), ncol=length(columns), byrow=TRUE) md <- data.frame(mm) colnames(md) <- columns
You will now have a data frame in md
that looks like
> head(md) geography_name median_household_income median_housing_value 1 Altoona School District 48699 151800 2 Census Tract 5.01 54498 132100 3 Census Tract 5.02 60018 139300 4 Census Tract 8.02 66432 186700 5 Census Tract 8.01 65833 149900 6 Census Tract 7 33365 117500 >
Unfortunately, the columns for median_household_income and median_housing_value are factors instead of numbers at this point. I don’t know an automated way to change them to numeric, so you’ll have to use
md$median_household_income = as.numeric(as.character(md$median_household_income)) md$median_housing_value = as.numeric(as.character(md$median_housing_value))
by hand to turn them into numbers. You could put that into your script, but it would need to change each time you wanted to get different columns from the data set. If you have an idea of how to make it automatically turn what should be a numeric column from a factor into a numeric I’d be grateful.
Once you have that done you can now do your favorite analysis on the data like plotting a density graph, creating a linear model, etc.
qplot(median_household_income, data=md, geom="density") model <- lm(median_housing_value ~ median_household_income, data=md)
Have fun and play around with it. There are quite a few fields in the ACS topline data that you can explore. Here are the fields in the topline data:
"percent_black" "percent_0_to_9_yo" "percent_pacific" "percent_income_50_to_75k" "percent_income_100_to_200k" "percent_asian" "md5id" "avg_household_size" "geo_geometry_type" "percent_house_value_100_to_200k" "percent_race_hispanic" "intersects" "percent_carpool" "percent_housing_owned" "_type" "percent_income_25_to_50k" "percent_income_75_to_100k" "percent_18_to_24_yo" "geography_name" "census_logrecno" "percent_drive_alone" "percent_hs_graduate" "percent_public_trans" "percent_house_value_500_to_1000k" "percent_house_value_lt_50k" "percent_house_value_gtr_1000k" "percent_housing_rented" "fips_id" "percent_10_to_17_yo" "percent_house_value_50_to_100k" "percent_income_lt_25k" "total_pop" "percent_race_nonhispanic" "percent_work_at_home" "percent_female" "percent_65_over_yo" "percent_white" "median_household_income" "percent_mixed_race" "percent_native" "percent_ba_or_above" "percent_less_hs" "percent_50_to_64_yo" "inside" "percent_35_to_49_yo" "percent_25_to_34_yo" "median_housing_value" "percent_income_gtr_200k" "percent_some_college" "percent_other_trans" "percent_male" "percent_house_value_200_to_500k" "coordinates"
I’ve created a Gist with all the code at https://gist.github.com/1208431.
UPDATE:
Thanks to Patrick Hausmann for creating a function that will all me to turn the entire results received from the API into a data frame without the ugly “columns” variable that I was using. Here is the new function he provided. I have updated the Gist referenced above with a new version of the code.
## Special thanks to Patrick Hausmann for the GetData function GetData <- function(x) { L <- vector(mode="list", length = x$total) a1 <- sapply(x$results, function(z) sapply(z, length) ) field.names <- names( which(apply(a1, 1, function(z) all(z == 1) )) ) a2 <- lapply(x$results, function(z) z[names(z) %in% field.names] ) for (i in seq_along(a2) ) { x1 <- a2[[i]] x2 <- data.frame(x1) L[[i]] <- x2 } x4 <- do.call(rbind, L) return(x4) } md <- GetData(results) str(md)
Filed under: Infochimps, R
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.