Use quick formula functions in purrr::map (+ base vs tidtyverse idiom comparisons/examples)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
I’ve converted the vast majority of my *apply
usage over to purrr
functions. In an attempt to make this a quick post, I’ll refrain from going into all the benefits of the purrr
package. Instead, I’ll show just one thing that’s super helpful: formula functions.
After seeing this Quartz article using a visualization to compare the frequency and volume of mass shootings, I wanted to grab the data to look at it with a stats-eye (humans are ++gd at visually identifying patterns, but we’re also ++gd as misinterpreting them, plus stats validates visual assumptions). I’m not going into that here, but will use the grabbing of the data to illustrate the formula functions. Note that there’s quite a bit of “setup” here for just one example, so I guess I kinda am attempting to shill the purrr
package and the “piping tidyverse” just a tad.
If you head on over to the site with the data you’ll see you can download files for all four years. In theory, these are all individual years, but the names of the files gave me pause:
MST Data 2013 - 2015.csv
MST Data 2014 - 2015.csv
MST Data 2015 - 2015.csv
Mass Shooting Data 2016 - 2016.csv
So, they may all be individual years, but the naming consistency isn’t there and it’s better to double check than to assume.
First, we can check to see if the column names are the same (we can eyeball this since there are only four files and a small # of columns):
library(purrr) library(readr) list.files() %>% map(read_csv) %>% map(colnames) ## [[1]] ## [1] "date" "name_semicolon_delimited" ## [3] "killed" "wounded" ## [5] "city" "state" ## [7] "sources_semicolon_delimited" ## ## [[2]] ## [1] "date" "name_semicolon_delimited" ## [3] "killed" "wounded" ## [5] "city" "state" ## [7] "sources_semicolon_delimited" ## ## [[3]] ## [1] "date" "name_semicolon_delimited" ## [3] "killed" "wounded" ## [5] "city" "state" ## [7] "sources_semicolon_delimited" ## ## [[4]] ## [1] "date" "name_semicolon_delimited" ## [3] "killed" "wounded" ## [5] "city" "state" ## [7] "sources_semicolon_delimited"
A quick inspection of the date
column shows it’s in month/day/year
format and we want to know if each file only spans one year. This is where the elegance of the formula function comes in:
library(lubridate) list.files() %>% map(read_csv) %>% map(~range(mdy(.$date))) # <--- the *entire* post was to show this one line ;-) ## [[1]] ## [1] "2016-01-06" "2016-07-25" ## ## [[2]] ## [1] "2013-01-01" "2013-12-31" ## ## [[3]] ## [1] "2014-01-01" "2014-12-29" ## ## [[4]] ## [1] "2015-01-01" "2015-12-31"
To break that down a bit:
list.files()
returns a vector of filenames in the current directory- the first
map()
reads each of those files in and creates a list with four elements, each being atibble
(data_frame
/data.frame
) - the second
map()
iterates over those data frames and calls a newly created anonymous function which converts thedate
column to a properDate
data type then gets the range of those dates, ultimately resulting in a four element list, with each element being a two element vector ofDate
s
For you “basers” out there, this is what that looks like old school style:
fils <- list.files() dfs <- lapply(fils, read.csv, stringsAsFactors=FALSE) lapply(dfs, function(x) range(as.Date(x$date, format="%m/%e/%Y")))
or
lapply(list.files(), function(x) { df <- read.csv(x, stringsAsFactors=FALSE) range(as.Date(df$date, format="%m/%e/%Y")) })
You eliminate the function(x) { }
and get pre-defined vars (either .x
or .
and, if needed, .y
) to compose your map
s and pipes very cleanly and succinctly, but still being super-readable.
After performing this inspection (i.e. that each file does contain only a incidents for a single year), we can now automate the data ingestion:
library(rvest) library(purrr) library(readr) library(dplyr) library(lubridate) read_html("https://www.massshootingtracker.org/data") %>% html_nodes("a[href^='https://docs.goo']") %>% html_attr("href") %>% map_df(read_csv) %>% mutate(date=mdy(date)) -> shootings
Here’s what that looks like w/o the tidyverse/piping:
library(XML) doc <- htmlParse("http://www.massshootingtracker.org/data") # note the necessary downgrade to "http" dfs <- xpathApply(doc, "//a[contains(@href, 'https://docs.goo')]", function(x) { csv <- xmlGetAttr(x, "href") df <- read.csv(csv, stringsAsFactors=FALSE) df$date <- as.Date(df$date, format="%m/%e/%Y") df }) shootings <- do.call(rbind, dfs)
Even hardcore “basers” may have to admit that the piping/tidyverse version is ultimately better.
Give the purrr
package a first (or second) look if you haven’t switched over to it. Type safety, readable anonymous functions and C-backed fast functional idioms will mean that your code may ultimately be purrrfect.
UPDATE
I received a question in the comments regarding how I came to that CSS selector for the gdocs CSV URLs, so I made a quick video of the exact steps I took. Exposition below the film.
Right-click “Inspect” in Chrome is my go-to method for finding what I’m after. This isn’t the place to dive deep into the dark art of web page spelunking, but in this case, when I saw there were four similar anchor (<a>
) tags that pointed to the CSV “files”, I took the easy way out and just built a selector based on the href
attribute value (or, more specifically, the characters at the start of the href
attribute). However, all four ways below end up targeting the same four elements:
pg <- read_html("https://www.massshootingtracker.org/data") html_nodes(pg, "a.btn.btn-default") html_nodes(pg, "a[href^='https://docs.goo']") html_nodes(pg, xpath=".//a[@class='btn btn-default']") html_nodes(pg, xpath=".//a[contains(@href, 'https://docs.goo')]")
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.