Site icon R-bloggers

Reading Remote Data Files

[This article was first published on R on kieranhealy.org, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Sometimes data arrives as a series of individual files each of which is organized in the same way—which is to say, each of which has the same variables, features, or columns. Imagine a series of tables reporting mandated information about every school in the state, or a hundred spreadsheets each with information about a different country, or thirty seven CSVs each with the same columns of information about representatives in each U.S. Congressional Session since 1945. Seeing as these data were not provided to us as a single big table we generally want to make it into one. If the files are CSVs on our local computer, R has some nice functions that allow us to iterate over a vector of filenames and produce a tidy table of data.

Case 1: Lots of local CSV files

For example, our Congressional project might have a data folder with a subfolder called congress. We can get a listing of the CSV files inside it like this:

r
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
filenames <- dir(path = here("files", "data", "congress"),
                 pattern = "*.csv",
                 full.names = TRUE)

filenames[1:15] # Just displaying the first 15

##  [1] "/Users/kjhealy/Documents/courses/vsd/files/data/congress/01_79_congress.csv"
##  [2] "/Users/kjhealy/Documents/courses/vsd/files/data/congress/02_80_congress.csv"
##  [3] "/Users/kjhealy/Documents/courses/vsd/files/data/congress/03_81_congress.csv"
##  [4] "/Users/kjhealy/Documents/courses/vsd/files/data/congress/04_82_congress.csv"
##  [5] "/Users/kjhealy/Documents/courses/vsd/files/data/congress/05_83_congress.csv"
##  [6] "/Users/kjhealy/Documents/courses/vsd/files/data/congress/06_84_congress.csv"
##  [7] "/Users/kjhealy/Documents/courses/vsd/files/data/congress/07_85_congress.csv"
##  [8] "/Users/kjhealy/Documents/courses/vsd/files/data/congress/08_86_congress.csv"
##  [9] "/Users/kjhealy/Documents/courses/vsd/files/data/congress/09_87_congress.csv"
## [10] "/Users/kjhealy/Documents/courses/vsd/files/data/congress/10_88_congress.csv"

We can feed that vector to read_csv() and it will be quite happy. It reads each file and binds them all together by row, in a big stack. (Again, remember we know ex ante that the files all have the same column structure. We’ll get errors or warnings if this isn’t true.)

r
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
df <- read_csv(filenames, id = "path",
                name_repair = janitor::make_clean_names)

df |> 
  mutate(congress = str_extract(path, "_\\d{2,3}_congress"), 
         congress = str_extract(congress, "\\d{2,3}")) |> 
  relocate(congress)

## # A tibble: 20,580 × 27
##    congress path   last  first middle suffix nickname born  death sex   position
##    <chr>    <chr>  <chr> <chr> <chr>  <chr>  <chr>    <chr> <chr> <chr> <chr>   
##  1 79       /User… Aber… Thom… Gerst… <NA>   <NA>     05/1… 01/2… M     U.S. Re…
##  2 79       /User… Adams Sher… <NA>   <NA>   <NA>     01/0… 10/2… M     U.S. Re…
##  3 79       /User… Aiken Geor… David  <NA>   <NA>     08/2… 11/1… M     U.S. Se…
##  4 79       /User… Allen Asa   Leona… <NA>   <NA>     01/0… 01/0… M     U.S. Re…
##  5 79       /User… Allen Leo   Elwood <NA>   <NA>     10/0… 01/1… M     U.S. Re…
##  6 79       /User… Almo… J.    Linds… Jr.    <NA>     06/1… 04/1… M     U.S. Re…
##  7 79       /User… Ande… Herm… Carl   <NA>   <NA>     01/2… 07/2… M     U.S. Re…
##  8 79       /User… Ande… Clin… Presba <NA>   <NA>     10/2… 11/1… M     U.S. Re…
##  9 79       /User… Ande… John  Zuing… <NA>   <NA>     03/2… 02/0… M     U.S. Re…
## 10 79       /User… Andr… Augu… Herman <NA>   <NA>     10/1… 01/1… M     U.S. Re…
## # ℹ 20,570 more rows
## # ℹ 16 more variables: party <chr>, state <chr>, district <chr>, start <chr>,
## #   end <chr>, religion <chr>, race <chr>, educational_attainment <chr>,
## #   job_type1 <chr>, job_type2 <chr>, job_type3 <chr>, job_type4 <chr>,
## #   job_type5 <chr>, mil1 <chr>, mil2 <chr>, mil3 <chr>

You can see how path is created as an id column, to help us keep track of which file each row of data came from. We create the congress column after we read in the data by extracting the congressional session from the filename with a regular expression. But you can see how nice it is to have this facility to read data in like this.

Case 2: Lots of local Excel files

What if the data is in a file format whose read-in function doesn’t know this trick about accepting a vector of file paths? In that case we can do what read_csv() is doing behind the scenes and map the vector of file names to the read-in function, and explicitly bind the results together. The default map() function binds by making a list of whatever you did. But there are type-specific mappers, too. We are dealing with data frames—with regular tables of data—so we have a function, map_dfr(), that takes input and binds its output by row into a data frame (or dies trying). So for example, if our Congressional data were all Excel files in .xlsx format, and they all had the same structure with a header row of column names, we could write this:

r
1
2
3
df <- filenames |>
    map_dfr(~ readxl::read_xlsx(.x),
          .id = "path")

And we would get (almost) the same result. Here we start with the vector of filenames and pass it down a pipe to map_dfr(), which goes ahead and maps, or applies, the read_xlsx() function to each element of the filenames vector—i.e. to each file. The .x there is a placeholder or pronoun that means ‘whatever one we’re working on right now’. Mapping functions like this is just a kind of iteration where you don’t have to explicitly write a loop. This makes it easier to cleanly compose sequences or chains of functions without having to explicitly create a bunch of placeholder objects or declare counters and so on.

Case 3: Lots of remote CSV or Excel files in a bare directory

Now, what if the files we want are stored remotely on a server? These days there’s often an API for such things. But quite often, even now, you may find yourself dealing (as I did yesterday) with a bare directory of files that looks like this:

A listing of state-level life-tables from the National Center for Health Statistics

This is a directory of state-by-state life tables associated with a CDC report. Again, we want them (or, as we’ll see, some of them) as a single table.

Now, if these were provided as CSVs our task would be a little easier because in addition to being able to deal with a vector of filenames at once, read_csv(), and indeed all the read-in functions in readr in general, will happily read URLs as well as local file paths. However, the read_xlsx() function in the readxl package can’t do this yet. It only wants file paths. A second issue is that the Excel files themselves are not entirely tidy. At the top they look like this:

The top of a state-level life table Excel file.

Those first two rows are a mild violation of one of the rules of thumb for entering data in spreadsheets, helpfully outlined by Karl Broman and Kara Woo. The first row is metadata; the second is a more verbose description of the standard lifetable headers in the third row. Except for Age, which is not labeled in the third row. That big box labeled “Age (years)” is actually an super-sized second row. That means the first element of row three, our actual column headers, is blank! This is annoying. Fortunately these are easily dealt with, as we can just tell our read function to skip those two lines. There’s also a Source row at the bottom (not shown here) that we’ll have to strip out.

But the first order of business is getting a vector of the actual file URLs to download. You could just copy and paste the listing, like an animal, but we are not going to do that. Instead, we’ll take advantage of the old-school empty-directory listing to get the file names. We’ll do this using R’s implementation of curl.

r
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
## The directory location
url_loc <- "ftp.cdc.gov/pub/Health_Statistics/NCHS/Publications/NVSR/71-02/"

## Prepare the connection
list_files <- curl::new_handle()
curl::handle_setopt(list_files, ftp_use_epsv = TRUE, dirlistonly = TRUE)

## Open the connection as an FTP listing
con <- curl::curl(url = paste0("ftp://", url_loc), open = "r", handle = list_files)

## Get the file names
files <- readLines(con)

## Close the connection 
close(con)

files[1:10] # Just show the first ten

## [1] "AK1.xlsx" "AK2.xlsx" "AK3.xlsx" "AK4.xlsx" "AL1.xlsx" "AL2.xlsx" "AL3.xlsx" "AL4.xlsx" "AR1.xlsx" "AR2.xlsx"

We open a connection to the remote folder and use ftp_use_epsv and dirlistonly flags to restrict what we get back. Then we read the lines recieved from the FTP server into the files object. This gives us the bare file names of everything in this remote folder.

The spreadsheets are named according to a scheme with a two-letter state abbreviation followed by a number. The number signifies the kind of life-table it is. The files ending in 1 have the life-table for the population as a whole, which is what we are interested in.

Now that we have a vector of the file names (but just the file names at this point) we can do a bit of prep:

r
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
## Just take files containing a `1``
fname_stubs <- files[str_detect(files, "1")]

## State abbreviations
fname_labs <- substr(fname_stubs, start = 1, stop = 2)

## Construct the filenames and give them a name attribute of the 2-letter state
## abbreviation, so they are trackable in the data frame we're about to make.
fnames <- paste0("https://", url_loc, fname_stubs)
fnames <- set_names(fnames, fname_labs)


fnames[1:4]

##                                                                               AK 
## "https://ftp.cdc.gov/pub/Health_Statistics/NCHS/Publications/NVSR/71-02/AK1.xlsx" 
##                                                                               AL 
## "https://ftp.cdc.gov/pub/Health_Statistics/NCHS/Publications/NVSR/71-02/AL1.xlsx" 
##                                                                               AR 
## "https://ftp.cdc.gov/pub/Health_Statistics/NCHS/Publications/NVSR/71-02/AR1.xlsx" 
##                                                                               AZ 
## "https://ftp.cdc.gov/pub/Health_Statistics/NCHS/Publications/NVSR/71-02/AZ1.xlsx" 

You can see that the vector elements (the actual URLs) also have a name or label (the state abbreviation). This will make it easier to create a state id column, because map_dfr() will use the label as its index counter.

The last step is to get read_xlsx() to get all the remote files, which it does not have the capacity to do directly. It won’t even accept a single URL, it only wants file paths. So we will have to write a one-off function that gets the file and puts it in a temporary location that read_xlsx() can see.

r
1
2
3
4
5
6
7
# Feed this an http URL
get_lifetable <- function(x) {
  httr::GET(x, httr::write_disk(tf <- tempfile(fileext = ".xlsx")))
  readxl::read_xlsx(tf, skip = 2, .name_repair = "unique_quiet") |>
    rename("age" = `...1`) |>
    filter(!str_detect(age, "SOURCE")) # remove trailing source line
}

The first line inside the function uses httr to GET the file, and immediately save it locally using write_disk(), taking care to specify that the temporary file we save should have an .xlsx extension. (Otherwise read_xlsx() will complain.) The second line actually reads in the file that’s been downloaded. We take the opportunity to suppress chatter about the name repair that has to happen on that blank first column header in the third row, rename that location age, and strip the trailing line about the source of the data that I mentioned above.

This function reads one given URL. Now we just need to map a vector of URLs to it and bind the results by row:

r
1
2
3
life_tabs <- fnames |>
  map_dfr(~ get_lifetable(.x),
          .id = "state")

As I said, a nice thing is that map_dfr() will use the name attribute of fnames to create its id column, which we can therefore name state, because the name of each URL element is the abbrevation for the state it is providing data about.

And we’re done:

r
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
life_tabs

## # A tibble: 5,151 × 8
##    state age          qx      lx     dx     Lx       Tx    ex
##    <chr> <chr>     <dbl>   <dbl>  <dbl>  <dbl>    <dbl> <dbl>
##  1 AK    0–1   0.00513   100000  513.   99623. 7661992   76.6
##  2 AK    1–2   0.000308   99487.  30.6  99472. 7562368.  76.0
##  3 AK    2–3   0.0000992  99457.   9.87 99452. 7462896.  75.0
##  4 AK    3–4   0.000194   99447.  19.3  99437. 7363444.  74.0
##  5 AK    4–5   0.000192   99428.  19.1  99418. 7264007   73.1
##  6 AK    5–6   0.000288   99408.  28.6  99394. 7164589   72.1
##  7 AK    6–7   0.000328   99380.  32.6  99364. 7065195   71.1
##  8 AK    7–8   0.000348   99347.  34.6  99330. 6965832.  70.1
##  9 AK    8–9   0.000349   99313.  34.6  99295. 6866502.  69.1
## 10 AK    9–10  0.000333   99278.  33.1  99261. 6767206   68.2
## # ℹ 5,141 more rows
## # ℹ Use `print(n = ...)` to see more rows

A few thousand rows of data programmatically extracted from fifty spreadsheets, now ready for a bit more cleaning and any amount of confusion about what it is that life-tables actually show.

To leave a comment for the author, please follow the link and comment on their blog: R on kieranhealy.org.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Exit mobile version