Extract US Census 2010 data with data.table and dplyr

R & Census

5 years ago

[This article was first published on R & Census, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

This post explains how to extract information from the original dataset of the US 2010 census summary file 1 with urban/rural update, using data.table or dplyr package in R.

Why do we want to work with the original data? You may ask, when there are already R packages, such as UScensus2010, censusapi, and tidycensus, which help user get the data.

The biggest benefit is that you will have full access to all the census 2010 data. The total size of the US 2010 national census summary 1 file with urban/rural update is nearly 150Gb, which is too heavy to be included as dataset in a package. The stand-alone UScensus2010 package only delivers selected demographic data. Others provide an access to United States Census Bureau’s APIs, which also offers selected data. By dealing with the original data directly, you can extract whatever data you want.

In recent years, fast development of packages data.table and dplyr makes it possible for R to process original 2010 census data in a reasonable time frame. In the following example, I retrieved the latitude, longitude, population of all race and of black people living in each census block in the city of South Bend Indiana. The whole process takes about 4 seconds on a four years old laptop. With these data in hand we can plot nicely where black people live down to census block level on a map downloaded from GoogleMap. In a city most census blocks are equivalent to street blocks.

This post is organized in this order: I will first give a brief introduction to the 2010 census summary file 1 with urban/rural update. Then I will show how to extract data step by step using R and data.table package with the example shown above. The codes are verbose in this section as I want to show the details. To clean it up I will then give a more concise codes that use pipe operator from magrittr package. As many users are more familiar with dplyr package, I will also translate the data.table code to dplyr code. The dplyr approach is sufficiently fast for most applications.

File structure of the census 2010 summary 1 file

The 2010 census data with urban/rural update can be downloaded from United States Census Bureau official site. The total size is nearly 150 GB so make sure you have enough hard drive space for it. The data is split into 50 states and DC. Click on a state, for example, Indiana, there is a file named in2010.ur1.zip. Download this file and unzip it to a folder named with the abbreviation “IN”. Do this for all other state and DC.

Inside the folder “IN/” there are a geographic header record file named ingeo2010.ur1 and 48 data files named as in000012010.ur1, in000022010.ur1, …, in000482010.ur1. The first two characters in is the abbreviation of Indiana. The file extension ur1 stands for summary file 1 with urban rural update. The last four numbers 2010 before file extension indicate the census year 2010. The numbers 00001, 00002, …, and 00048 are the sequence of the files. These files are called file 01, file 02, …, and file 48 in the techinical documentation of the summary file 1. The documentation is our dictionary in using these files and we will talk more about it later.

The geographic header record file ingeo2010.ur1 contains the geographic information. It has 331556 lines; each line corresponds to a geographic entity in the census data of Indiana. A geographic entity can be a state, a county, a city, a census tract, a census block etc. A line can be up to 500 characters long, including spaces. Unique to each line is the logical record number, which is 7 characters long, located from 19 to 25 characters in a line. Page 2-8 of the technical documentation lists all geographic field in a line. Each field is assigned with a short reference code. Here are list of frequently use ones:

geographic field	reference	starting position	ending position
logical record number	LOGRECNO	19	25
summary level	SUMLEV	9	11
geographic component	GEOCOMP	12	13
county FIPS	COUNTY	30	32
place FIPS	PLACE	46	50
Metropolitan Area FIPS	CBSA	113	117
internal point (latitude)	INTPTLAT	337	347
internal point (longitude)	INTPTLON	348	359

The data files contain the recorded census data. They are .csv files, that is, each data field is separated by a comma. For example, file 02 contains the population in urban and rural area. Its 5th field is the logical record number that matches the geographic header record file. The 6th field is total population, 7th field urban population, 8th field population in urbanized area, … The properties of each field is listed in the technical documentation. If you want to know the details of, for example file 17, just search in the pdf file of the technical documentation for “file 17” and read through the description.

Extract the 2010 census data step by step

Say we are interested in the population and race in the city of South Bend, Indiana and we want to plot these data on a map at census block level, what should we do? I will use this as an example to show how to extract 2010 census data in detail.

read and extract geographic data

Let’s first take a look at the geographic header record file. We convert it to a data.table of which each row is a line in the file. Each line is a string of characters without separator, so we read it into only one column.

library(data.table)
# change path to data based on your local directories
geo_file <- fread(paste0(path_to_data, "IN/ingeo2010.ur1"), 
                  sep = "\n", 
                  header = FALSE)
# show total number of lines
dim(geo_file)
## [1] 331556      1
# display the first line
head(geo_file, 1)
##                                                                                                                                                                                                                                                                                                                                                                                               V1
## 1: UR1ST IN04000000  00000012318                                                                                                                                                                            92789193658    1537004191Indiana                                                                                   AN  6483802  2795541+39.9030256-086.283950300            00448508

This file has 331556 lines. There are many blank spaces in a line as a geographic entity usually does not have all the geographic fields.

We want get the logical record number, which is used to match those in data file, and latitude and longitude for plot in map. We want to keep the FIPS of place so that later on we can locate South Bend city with its FIPS number. We also want keep summary levels as we only want data at census blocks level. The references of geographic fields are used as column names.

geo <- geo_file[, .(LOGRECNO = as.numeric(substr(V1, 19, 25)),
                    SUMLEV = substr(V1, 9, 11),
                    PLACE = substr(V1, 46, 50),
                    INTPTLAT = as.numeric(substr(V1, 337, 347)),
                    INTPTLON = as.numeric(substr(V1, 348, 359)))]
geo
##         LOGRECNO SUMLEV PLACE INTPTLAT  INTPTLON
##      1:        1    040       39.90303 -86.28395
##      2:        2    040       40.13668 -86.23489
##      3:        3    040       40.43883 -86.21876
##      4:        4    040       40.12984 -86.21021
##      5:        5    040       39.92054 -86.27358
##     ---                                         
## 331552:   331552    970       41.63835 -85.55221
## 331553:   331553    970       41.67686 -87.49205
## 331554:   331554    970       41.15086 -85.66872
## 331555:   331555    970       39.35130 -86.03878
## 331556:   331556    970       41.63419 -87.20929

The column PLACE has a lot of missing values, as many geographic entities does not belong to any place.

Now we can extract geographic data of South Bend. An easy way to find its FIPS number is from its Wikipedia page. South Bend’s FIPS is 18-71000. The first two digits are the FIPS of Indiana, so the unique FIPS in Indiana for South Bend is 71000. Let’s keep only the rows for South Bend (sb).

sb_geo <- geo[PLACE == "71000"]
sb_geo
##       LOGRECNO SUMLEV PLACE INTPTLAT  INTPTLON
##    1:   241621    070 71000 41.61875 -86.24093
##    2:   241622    080 71000 41.63354 -86.22549
##    3:   241623    085 71000 41.63354 -86.22549
##    4:   241624    091 71000 41.63525 -86.21833
##    5:   241625    090 71000 41.63525 -86.21833
##   ---                                         
## 5461:   321382    614 71000 41.72872 -86.26427
## 5462:   324739    624 71000 41.70951 -86.20096
## 5463:   324774    624 71000 41.66799 -86.23096
## 5464:   324810    624 71000 41.66797 -86.27524
## 5465:   324851    624 71000 41.70286 -86.27826

This geographic data includes all summary levels. As we only want the census block level data, we further filter with SUMLEV == "100". At this stage we get the geographic data we want.

sb_block <- sb_geo[SUMLEV == "100"]
sb_block
##       LOGRECNO SUMLEV PLACE INTPTLAT  INTPTLON
##    1:   241626    100 71000 41.63613 -86.21864
##    2:   241627    100 71000 41.63670 -86.21659
##    3:   241628    100 71000 41.63573 -86.22172
##    4:   241629    100 71000 41.63182 -86.22022
##    5:   241630    100 71000 41.63367 -86.22093
##   ---                                         
## 5002:   252566    100 71000 41.69486 -86.25132
## 5003:   252567    100 71000 41.69649 -86.25815
## 5004:   253091    100 71000 41.73058 -86.35508
## 5005:   253092    100 71000 41.73035 -86.35565
## 5006:   253093    100 71000 41.72831 -86.35573

Are we sure we get the correct geographic data? We can plot the latitude and longitude directly on a map. From the map below you will also get a sense of what a census block is; it is basically a street block in urban area. We need internet connection to download map data from Google Map.

map <- get_map("south bend, indiana", zoom = 13)
ggmap(map) +
    geom_point(data = sb_block, aes(INTPTLON, INTPTLAT), color = "red", alpha = 0.3)

read and extract population and race data

From the technical documentation we know that population and race information is in file 03. So let’s read this file.

f03 <- fread(paste0(path_to_data, "IN/in000032010.ur1"), header = FALSE)
## 
Read 72.4% of 331556 rows
Read 331556 rows and 199 (of 199) columns from 0.134 GB file in 00:00:03
# number of rows and columns
dim(f03)
## [1] 331556    199

It has 199 columns; which ones are what we want? We still go back to the technical documentation, page 6-22, and read the description for file 03. The 5th field is logical record number, 6th the total population data and 8th the black population data. Each population data field also has a reference but is hard to follow. For clarity we name these column in plain English. So all the data we need in Indiana is

population <- f03[, .(LOGRECNO = V5,
                      total_popul = V6,
                      black_popul = V8)]
population
##         LOGRECNO total_popul black_popul
##      1:        1     6483802      591397
##      2:        2     4697100      580128
##      3:        3     3836584      560288
##      4:        4      860516       19840
##      5:        5     1786702       11269
##     ---                                 
## 331552:   331552       18918          24
## 331553:   331553        4636         154
## 331554:   331554       10537          25
## 331555:   331555         560         100
## 331556:   331556           0           0

From here it is easy to get the population in South Bend at census block level: just join the data to sb_block by logical record number:

sb_block_popl <- population[sb_block, on = .(LOGRECNO)]
sb_block_popl
##       LOGRECNO total_popul black_popul SUMLEV PLACE INTPTLAT  INTPTLON
##    1:   241626          28          10    100 71000 41.63613 -86.21864
##    2:   241627           0           0    100 71000 41.63670 -86.21659
##    3:   241628          52          16    100 71000 41.63573 -86.22172
##    4:   241629         279          21    100 71000 41.63182 -86.22022
##    5:   241630          42           1    100 71000 41.63367 -86.22093
##   ---                                                                 
## 5002:   252566          64           0    100 71000 41.69486 -86.25132
## 5003:   252567           0           0    100 71000 41.69649 -86.25815
## 5004:   253091           0           0    100 71000 41.73058 -86.35508
## 5005:   253092           0           0    100 71000 41.73035 -86.35565
## 5006:   253093           0           0    100 71000 41.72831 -86.35573

There is no need to keep the blocks where there are no people lives.

sb_block_popul <- sb_block_popl[total_popul > 0]
sb_block_popul
##       LOGRECNO total_popul black_popul SUMLEV PLACE INTPTLAT  INTPTLON
##    1:   241626          28          10    100 71000 41.63613 -86.21864
##    2:   241628          52          16    100 71000 41.63573 -86.22172
##    3:   241629         279          21    100 71000 41.63182 -86.22022
##    4:   241630          42           1    100 71000 41.63367 -86.22093
##    5:   241631          48           3    100 71000 41.63525 -86.21833
##   ---                                                                 
## 3780:   252543          75           6    100 71000 41.66027 -86.32578
## 3781:   252551          36           0    100 71000 41.69367 -86.24275
## 3782:   252559         360          11    100 71000 41.69560 -86.25924
## 3783:   252561          14           0    100 71000 41.69583 -86.25781
## 3784:   252566          64           0    100 71000 41.69486 -86.25132

plot on map at block level

Now we can plot total and black population on the map. This is the same map we saw at the beginning of the post. It is not shown here.

ggmap(map) +
    geom_point(data = sb_block_popul, 
               aes(INTPTLON, INTPTLAT, size = total_popul,  color = "red"), 
               alpha = 0.6) +
    # remove row with 0 black population, otherwise show a small dot
    geom_point(data = sb_block_popul[black_popul > 0], 
               aes(INTPTLON, INTPTLAT, size = black_popul, color = "blue"), 
               alpha = 0.6) +
    scale_size_area(breaks = c(1, 50, 100, 300, 600, 1000)) +
    scale_color_identity(guide = "legend", 
                         breaks = c("red", "blue"),
                         label = c("all", "black")) +
    labs(color = "race", size = "population") +
    guides(size = guide_legend(override.aes = list(shape = 1)))

More concise code

the data.table way

The above codes that select South Bend data can be squeezed with pipe operator %>% to get rid of intermediate variables.

library(data.table)
library(magrittr)

## An example to retrieve and plot the total and black population at the census
## black level of the city South Bend in Indiana

# the directory holding all census 2010 data
path_to_data <- "~/dropbox_datasets/US_2010_census/"

# geographic record of South Bend (sb) at census block level
sb_block <- fread(paste0(path_to_data, "IN/ingeo2010.ur1"), sep = "\n", 
                  header = FALSE) %>%
    .[, .(LOGRECNO = as.numeric(substr(V1, 19, 25)),
          SUMLEV = substr(V1, 9, 11),
          PLACE = substr(V1, 46, 50),
          INTPTLAT = as.numeric(substr(V1, 337, 347)),
          INTPTLON = as.numeric(substr(V1, 348, 359)))] %>%
    .[PLACE == "71000" & SUMLEV == "100"]

# total and black population of South Bend at census block level
sb_block_popul <- fread(paste0(path_to_data, "IN/in000032010.ur1"), 
                       header = FALSE) %>%
    .[, .(LOGRECNO = V5,
          total_popul = V6,
          black_popul = V8)] %>%
    .[sb_block, on = .(LOGRECNO)] %>%
    .[total_popul > 0]

the dplyr approach

We still use data.table’s fread() to read the data but use dplyr functions to process data.

library(data.table)
library(dplyr)
path_to_data <- "~/dropbox_datasets/US_2010_census/"

sb_block <- fread(paste0(path_to_data, "IN/ingeo2010.ur1"), sep = "\n", 
                  header = FALSE) %>%
    transmute(
        LOGRECNO = as.numeric(substr(V1, 19, 25)),
        SUMLEV = substr(V1, 9, 11),
        PLACE = substr(V1, 46, 50),
        INTPTLAT = as.numeric(substr(V1, 337, 347)),
        INTPTLON = as.numeric(substr(V1, 348, 359))
    ) %>%
    filter(PLACE == "71000" & SUMLEV == "100")

sb_block_popul <- fread(paste0(path_to_data, "IN/in000032010.ur1"), header = FALSE) %>%
    transmute(
        LOGRECNO = V5,
        total_popul = V6,
        black_popul = V8
    ) %>%
    right_join(sb_block, by = "LOGRECNO") %>%
    filter(total_popul > 0)

To leave a comment for the author, please follow the link and comment on their blog: R & Census.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.