Generating metropolitan subsets of Census data with R and tigris
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Need help working with Census data in your project? Contact me at [email protected] to discuss consulting support or a training workshop!
Commonly, studies that use US Census data focus on topics at the scale of the metropolitan area. However, subsetting Census geographic data by metropolitan area is not always straightforward. Such a workflow for Census tracts might look something like:
- Manually downloading Census tract shapefiles (often available by state);
- Looking up the counties in a given metropolitan area, along with their FIPS codes;
- Subsetting the data by those FIPS codes.
- Repeating the process for each state included in the metropolitan area and then merging the result (if tracts are obtained by state).
This task – which can be tedious – is well-suited for R and the tigris package. R, by way of the sf package, has powerful functionality for spatial subsetting. Further, with a little custom code, we can set up a framework for rapidly retrieving Census tracts for any metropolitan area we want. To get started, let’s load the required packages and set options to cache data from tigris and load the data as sf objects.
library(tigris) library(sf) library(tidyverse) options(tigris_class = "sf") options(tigris_use_cache = TRUE)
Let’s say we want to obtain Census tracts for the Portland, Oregon metropolitan area, which includes areas in both Oregon and Washington. We can fetch Census tracts for Oregon and Washington and combine them with the rbind_tigris
function.
orwa <- rbind_tigris( tracts("OR", cb = TRUE), tracts("WA", cb = TRUE) ) ggplot(orwa) + geom_sf()
We can then subset these tracts spatially by locating the boundary of the Portland metropolitan area available in the core_based_statistical_areas
function. Use the View
function in RStudio and browse the dataset if you are unsure of how to format the string used to filter the data.
cb <- core_based_statistical_areas(cb = TRUE) pdx <- filter(cb, grepl("Portland-Vancouver", NAME)) ggplot(pdx) + geom_sf()
The simplest way to do spatial subsetting in sf is by indexing the data you’d like to subset by the spatial overlay. This will return all tracts that intersect the metropolitan area.
p1 <- orwa[pdx,] ggplot() + geom_sf(data = p1) + geom_sf(data = pdx, fill = NA, color = "red")
We notice a problem here - we’ve returned all tracts within the metropolitan area, but also those outside it that touch the metropolitan area’s boundary. Fortunately, sf includes a suite of methods for spatial overlay. We’ll use the st_within
method to identify those tracts within the boundary of the metropolitan area. Further, as the Census Bureau designs its shapefiles to align cleanly, we shouldn’t have any spatial overlay problems here.
w1 <- st_within(orwa, pdx) print(length(w1)) ## [1] 2282 print(w1[1:5]) ## [[1]] ## integer(0) ## ## [[2]] ## integer(0) ## ## [[3]] ## integer(0) ## ## [[4]] ## integer(0) ## ## [[5]] ## [1] 1
We see that the object w1
is a list with the same length as our orwa
data frame. Values either are length-0 integers - representing tracts that are not within the metro area - or 1
, representing tracts that are. In turn, we can use purrr to convert this to a logical vector and subset our data accordingly.
w2 <- map_lgl(w1, function(x) { if (length(x) == 1) { return(TRUE) } else { return(FALSE) } }) p2 <- orwa[w2,] ggplot() + geom_sf(data = p2) + geom_sf(data = pdx, fill = NA, color = "red")
Perfect! Now, let’s say you want to make this extensible to other metropolitan areas on-demand. We’ll need a function that does the following:
- Takes a metropolitan area as input, and detects the states in which the metro area is located;
- Retrieves tracts for those states;
- Identifies the tracts located within the metro area’s boundary.
Here’s how to do it:
metro_tracts <- function(metro_name) { # First, identify which states intersect the metro area using the # `states` function in tigris st <- states(cb = TRUE) cb <- core_based_statistical_areas(cb = TRUE) metro <- filter(cb, grepl(metro_name, NAME)) stcodes <- st[metro,]$STATEFP # Then, fetch the tracts, using rbind_tigris if there is more # than one state if (length(stcodes) > 1) { tr <- rbind_tigris( map(stcodes, function(x) { tracts(x, cb = TRUE) }) ) } else { tr <- tracts(x, cb = TRUE) } # Now, find out which tracts are within the metro area within <- st_within(tr, metro) within_lgl <- map_lgl(within, function(x) { if (length(x) == 1) { return(TRUE) } else { return(FALSE) } }) # Finally, subset and return the output output <- tr[within_lgl,] return(output) }
Let’s try it out! Here’s Chicago:
chi <- metro_tracts("Chicago") ggplot(chi) + geom_sf()
And here’s Boston:
bos <- metro_tracts("Boston") ggplot(bos) + geom_sf()
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.