Imputing missing values in parallel using {furrr}
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Today I saw this tweet on my timeline:
For those of us that just can't wait until RStudio officially supports parallel purrr in #rstats, boy have I got something for you.
— Davis Vaughan (@dvaughan32) April 13, 2018
Introducing `furrr`, parallel purrr through the use of futures. Go ahead, break things, you know you want to:https://t.co/l9z1UC2Tew
and as a heavy {purrr}
user, as well as the happy owner of a 6-core AMD Ryzen 5 1600X cpu, I was very excited to try out {furrr}
. For those unfamiliar with {purrr}
, you can read some of my previous blog posts on it here, here or here.
To summarize very quickly: {purrr}
contains so-called higher order functions, which are functions that take other functions as argument. One such function is map()
. Consider the following simple example:
numbers <- seq(1, 10)
If you want the square root of this numbers, you can of course simply use the sqrt()
function, because it is vectorized:
sqrt(numbers) ## [1] 1.000000 1.414214 1.732051 2.000000 2.236068 2.449490 2.645751 ## [8] 2.828427 3.000000 3.162278
But in a lot of situations, the solution is not so simple. Sometimes you have to loop over the values. This is what we would need to do if sqrt()
was not vectorized:
sqrt_numbers <- rep(0, 10) for(i in length(numbers)){ sqrt_numbers[i] <- sqrt(numbers[i]) }
First, you need to initialize a container, and then you have to populate the sqrt_numbers
list with the results. Using, {purrr}
is way easier:
library(tidyverse) map(numbers, sqrt) ## [[1]] ## [1] 1 ## ## [[2]] ## [1] 1.414214 ## ## [[3]] ## [1] 1.732051 ## ## [[4]] ## [1] 2 ## ## [[5]] ## [1] 2.236068 ## ## [[6]] ## [1] 2.44949 ## ## [[7]] ## [1] 2.645751 ## ## [[8]] ## [1] 2.828427 ## ## [[9]] ## [1] 3 ## ## [[10]] ## [1] 3.162278
map()
is only one of the nice functions that are bundled inside {purrr}
. Mastering {purrr}
can really make you a much more efficient R programmer. Anyways, recently, I have been playing around with imputation and the {mice}
package. {mice}
comes with an example dataset called boys
, let’s take a look at it:
library(mice) data(boys) brotools::describe(boys) %>% select(variable, type, n_missing, everything()) ## # A tibble: 9 x 12 ## variable type n_missing mean sd mode min max q25 median ## <chr> <chr> <int> <dbl> <dbl> <chr> <dbl> <dbl> <dbl> <dbl> ## 1 age Numeric 0 9.16 6.89 <NA> 0.0350 21.2 1.58 10.5 ## 2 bmi Numeric 21 18.1 3.05 <NA> 11.8 31.7 15.9 17.4 ## 3 hc Numeric 46 51.5 5.91 <NA> 33.7 65.0 48.1 53.0 ## 4 hgt Numeric 20 132 46.5 <NA> 50.0 198 84.9 147 ## 5 tv Numeric 522 11.9 7.99 <NA> 1.00 25.0 4.00 12.0 ## 6 wgt Numeric 4 37.2 26.0 <NA> 3.14 117 11.7 34.7 ## 7 gen Factor 503 NA NA <NA> NA NA NA NA ## 8 phb Factor 503 NA NA <NA> NA NA NA NA ## 9 reg Factor 3 NA NA south NA NA NA NA ## # ... with 2 more variables: q75 <dbl>, n_unique <int>
In the code above I use the describe()
function from my personal package to get some summary statistics of the boys
dataset (you can read more about this function here). I am especially interested in the number of missing values, which is why I re-order the columns. If I did not re-order the columns, it would not appear in the output on my blog.
We see that some columns have a lot of missing values. Using the mice
function, it is very easy to impute them:
start <- Sys.time() imp_boys <- mice(boys, m = 10, maxit = 100, printFlag = FALSE) end <- Sys.time() - start print(end) ## Time difference of 3.290611 mins
Imputation on a single core took around 3 minutes on my computer. This might seem ok, but if you have a larger data set with more variables, 3 minutes can become 3 hours. And if you increase maxit
, which helps convergence, or the number of imputations, 3 hours can become 30 hours. With a 6-core CPU this could potentially be brought down to 5 hours (in theory). Let’s see if we can go faster, but first let’s take a look at the imputed data.
The mice()
function returns a mids
object. If you want to look at the data, you have to use the complete()
function (careful, there is also a complete()
function in the {tidyr}
package, so to avoid problems, I suggest you explicitely call mice::complete()
):
imp_boys <- mice::complete(imp_boys, "long") brotools::describe(imp_boys) %>% select(variable, type, n_missing, everything()) ## # A tibble: 11 x 12 ## variable type n_missing mean sd mode min max q25 median ## <chr> <chr> <int> <dbl> <dbl> <chr> <dbl> <dbl> <dbl> <dbl> ## 1 age Numer… 0 9.16 6.89 <NA> 0.0350 21.2 1.58 10.5 ## 2 bmi Numer… 0 18.0 3.03 <NA> 11.8 31.7 15.9 17.4 ## 3 hc Numer… 0 51.6 5.89 <NA> 33.7 65.0 48.3 53.2 ## 4 hgt Numer… 0 131 46.5 <NA> 50.0 198 83.0 146 ## 5 tv Numer… 0 8.39 8.09 <NA> 1.00 25.0 2.00 3.00 ## 6 wgt Numer… 0 37.1 26.0 <NA> 3.14 117 11.7 34.6 ## 7 .id Factor 0 NA NA 3 NA NA NA NA ## 8 .imp Factor 0 NA NA 1 NA NA NA NA ## 9 gen Factor 0 NA NA G1 NA NA NA NA ## 10 phb Factor 0 NA NA P1 NA NA NA NA ## 11 reg Factor 0 NA NA south NA NA NA NA ## # ... with 2 more variables: q75 <dbl>, n_unique <int>
As expected, no more missing values. The “long” argument inside mice::complete()
is needed if you want the complete()
function to return a long dataset. Doing the above “manually” using {purrr}
is possible with the following code:
start <- Sys.time() imp_boys_purrr <- map(rep(1, 10), ~mice(data = boys, m = ., maxit = 100, printFlag = FALSE)) end <- Sys.time() - start print(end) ## Time difference of 3.393966 mins
What this does is map the function ~mice(data = boys, m = ., maxit = 100, printFlag = FALSE)
to a list of 1
s, and creates 10 imputed data sets. m = .
means that m
will be equal to whatever is inside the list we are mapping our function over, so 1
, then 1
then another 1
etc…. It took around the same amount of time as using mice()
directly.
imp_boys_purrr
is now a list of 10 mids
objects. We thus need to map mice::complete()
to imp_boys_purrr
to get the data:
imp_boys_purrr_complete <- map(imp_boys_purrr, mice::complete)
Now, imp_boys_purrr_complete
is a list of 10 datasets. Let’s map brotools::describe()
to it:
map(imp_boys_purrr_complete, brotools::describe) ## [[1]] ## # A tibble: 9 x 12 ## variable type mean sd mode min max q25 median q75 ## <chr> <chr> <dbl> <dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 age Numeric 9.16 6.89 <NA> 0.0350 21.2 1.58 10.5 15.3 ## 2 bmi Numeric 18.0 3.03 <NA> 11.8 31.7 15.9 17.4 19.5 ## 3 hc Numeric 51.7 5.90 <NA> 33.7 65.0 48.3 53.1 56.0 ## 4 hgt Numeric 131 46.5 <NA> 50.0 198 84.0 146 175 ## 5 tv Numeric 8.35 8.00 <NA> 1.00 25.0 2.00 3.00 15.0 ## 6 wgt Numeric 37.2 26.0 <NA> 3.14 117 11.7 34.7 59.6 ## 7 gen Factor NA NA G1 NA NA NA NA NA ## 8 phb Factor NA NA P1 NA NA NA NA NA ## 9 reg Factor NA NA south NA NA NA NA NA ## # ... with 2 more variables: n_missing <int>, n_unique <int> ## ## [[2]] ## # A tibble: 9 x 12 ## variable type mean sd mode min max q25 median q75 ## <chr> <chr> <dbl> <dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 age Numeric 9.16 6.89 <NA> 0.0350 21.2 1.58 10.5 15.3 ## 2 bmi Numeric 18.0 3.03 <NA> 11.8 31.7 15.9 17.5 19.5 ## 3 hc Numeric 51.6 5.88 <NA> 33.7 65.0 48.3 53.2 56.0 ## 4 hgt Numeric 131 46.6 <NA> 50.0 198 83.5 145 175 ## 5 tv Numeric 8.37 8.02 <NA> 1.00 25.0 2.00 3.00 15.0 ## 6 wgt Numeric 37.1 26.0 <NA> 3.14 117 11.9 34.6 59.6 ## 7 gen Factor NA NA G1 NA NA NA NA NA ## 8 phb Factor NA NA P2 NA NA NA NA NA ## 9 reg Factor NA NA south NA NA NA NA NA ## # ... with 2 more variables: n_missing <int>, n_unique <int> ## ## [[3]] ## # A tibble: 9 x 12 ## variable type mean sd mode min max q25 median q75 ## <chr> <chr> <dbl> <dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 age Numeric 9.16 6.89 <NA> 0.0350 21.2 1.58 10.5 15.3 ## 2 bmi Numeric 18.1 3.04 <NA> 11.8 31.7 15.9 17.5 19.5 ## 3 hc Numeric 51.6 5.87 <NA> 33.7 65.0 48.5 53.3 56.0 ## 4 hgt Numeric 131 46.6 <NA> 50.0 198 83.0 145 175 ## 5 tv Numeric 8.46 8.14 <NA> 1.00 25.0 2.00 3.00 15.0 ## 6 wgt Numeric 37.2 26.1 <NA> 3.14 117 11.7 34.6 59.6 ## 7 gen Factor NA NA G1 NA NA NA NA NA ## 8 phb Factor NA NA P1 NA NA NA NA NA ## 9 reg Factor NA NA south NA NA NA NA NA ## # ... with 2 more variables: n_missing <int>, n_unique <int> ## ## [[4]] ## # A tibble: 9 x 12 ## variable type mean sd mode min max q25 median q75 ## <chr> <chr> <dbl> <dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 age Numeric 9.16 6.89 <NA> 0.0350 21.2 1.58 10.5 15.3 ## 2 bmi Numeric 18.1 3.02 <NA> 11.8 31.7 15.9 17.5 19.4 ## 3 hc Numeric 51.7 5.93 <NA> 33.7 65.0 48.5 53.4 56.0 ## 4 hgt Numeric 131 46.5 <NA> 50.0 198 82.9 145 175 ## 5 tv Numeric 8.45 8.11 <NA> 1.00 25.0 2.00 3.00 15.0 ## 6 wgt Numeric 37.2 26.0 <NA> 3.14 117 11.7 34.7 59.6 ## 7 gen Factor NA NA G1 NA NA NA NA NA ## 8 phb Factor NA NA P1 NA NA NA NA NA ## 9 reg Factor NA NA south NA NA NA NA NA ## # ... with 2 more variables: n_missing <int>, n_unique <int> ## ## [[5]] ## # A tibble: 9 x 12 ## variable type mean sd mode min max q25 median q75 ## <chr> <chr> <dbl> <dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 age Numeric 9.16 6.89 <NA> 0.0350 21.2 1.58 10.5 15.3 ## 2 bmi Numeric 18.0 3.03 <NA> 11.8 31.7 15.9 17.5 19.5 ## 3 hc Numeric 51.6 5.91 <NA> 33.7 65.0 48.3 53.2 56.0 ## 4 hgt Numeric 131 46.6 <NA> 50.0 198 83.0 146 175 ## 5 tv Numeric 8.21 8.02 <NA> 1.00 25.0 2.00 3.00 15.0 ## 6 wgt Numeric 37.1 26.0 <NA> 3.14 117 11.7 34.6 59.6 ## 7 gen Factor NA NA G1 NA NA NA NA NA ## 8 phb Factor NA NA P1 NA NA NA NA NA ## 9 reg Factor NA NA south NA NA NA NA NA ## # ... with 2 more variables: n_missing <int>, n_unique <int> ## ## [[6]] ## # A tibble: 9 x 12 ## variable type mean sd mode min max q25 median q75 ## <chr> <chr> <dbl> <dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 age Numeric 9.16 6.89 <NA> 0.0350 21.2 1.58 10.5 15.3 ## 2 bmi Numeric 18.0 3.05 <NA> 11.8 31.7 15.9 17.4 19.5 ## 3 hc Numeric 51.7 5.89 <NA> 33.7 65.0 48.3 53.2 56.0 ## 4 hgt Numeric 131 46.5 <NA> 50.0 198 83.0 146 175 ## 5 tv Numeric 8.44 8.24 <NA> 1.00 25.0 2.00 3.00 15.0 ## 6 wgt Numeric 37.1 26.0 <NA> 3.14 117 11.7 34.6 59.6 ## 7 gen Factor NA NA G1 NA NA NA NA NA ## 8 phb Factor NA NA P1 NA NA NA NA NA ## 9 reg Factor NA NA south NA NA NA NA NA ## # ... with 2 more variables: n_missing <int>, n_unique <int> ## ## [[7]] ## # A tibble: 9 x 12 ## variable type mean sd mode min max q25 median q75 ## <chr> <chr> <dbl> <dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 age Numeric 9.16 6.89 <NA> 0.0350 21.2 1.58 10.5 15.3 ## 2 bmi Numeric 18.1 3.04 <NA> 11.8 31.7 15.9 17.4 19.5 ## 3 hc Numeric 51.6 5.88 <NA> 33.7 65.0 48.2 53.2 56.0 ## 4 hgt Numeric 131 46.6 <NA> 50.0 198 83.5 146 175 ## 5 tv Numeric 8.47 8.15 <NA> 1.00 25.0 2.00 3.00 15.0 ## 6 wgt Numeric 37.2 26.1 <NA> 3.14 117 11.7 34.6 59.6 ## 7 gen Factor NA NA G1 NA NA NA NA NA ## 8 phb Factor NA NA P1 NA NA NA NA NA ## 9 reg Factor NA NA south NA NA NA NA NA ## # ... with 2 more variables: n_missing <int>, n_unique <int> ## ## [[8]] ## # A tibble: 9 x 12 ## variable type mean sd mode min max q25 median q75 ## <chr> <chr> <dbl> <dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 age Numeric 9.16 6.89 <NA> 0.0350 21.2 1.58 10.5 15.3 ## 2 bmi Numeric 18.0 3.04 <NA> 11.8 31.7 15.9 17.4 19.4 ## 3 hc Numeric 51.6 5.85 <NA> 33.7 65.0 48.2 53.3 56.0 ## 4 hgt Numeric 131 46.5 <NA> 50.0 198 83.0 146 175 ## 5 tv Numeric 8.36 8.06 <NA> 1.00 25.0 2.00 3.00 15.0 ## 6 wgt Numeric 37.2 26.1 <NA> 3.14 117 11.7 34.6 59.6 ## 7 gen Factor NA NA G1 NA NA NA NA NA ## 8 phb Factor NA NA P1 NA NA NA NA NA ## 9 reg Factor NA NA south NA NA NA NA NA ## # ... with 2 more variables: n_missing <int>, n_unique <int> ## ## [[9]] ## # A tibble: 9 x 12 ## variable type mean sd mode min max q25 median q75 ## <chr> <chr> <dbl> <dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 age Numeric 9.16 6.89 <NA> 0.0350 21.2 1.58 10.5 15.3 ## 2 bmi Numeric 18.0 3.05 <NA> 11.8 31.7 15.9 17.4 19.5 ## 3 hc Numeric 51.6 5.90 <NA> 33.7 65.0 48.3 53.2 56.0 ## 4 hgt Numeric 131 46.6 <NA> 50.0 198 83.9 146 175 ## 5 tv Numeric 8.57 8.25 <NA> 1.00 25.0 2.00 3.00 15.0 ## 6 wgt Numeric 37.1 26.1 <NA> 3.14 117 11.7 34.6 59.6 ## 7 gen Factor NA NA G1 NA NA NA NA NA ## 8 phb Factor NA NA P1 NA NA NA NA NA ## 9 reg Factor NA NA south NA NA NA NA NA ## # ... with 2 more variables: n_missing <int>, n_unique <int> ## ## [[10]] ## # A tibble: 9 x 12 ## variable type mean sd mode min max q25 median q75 ## <chr> <chr> <dbl> <dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 age Numeric 9.16 6.89 <NA> 0.0350 21.2 1.58 10.5 15.3 ## 2 bmi Numeric 18.0 3.04 <NA> 11.8 31.7 15.9 17.4 19.5 ## 3 hc Numeric 51.6 5.89 <NA> 33.7 65.0 48.3 53.1 56.0 ## 4 hgt Numeric 131 46.6 <NA> 50.0 198 83.0 146 175 ## 5 tv Numeric 8.49 8.18 <NA> 1.00 25.0 2.00 3.00 15.0 ## 6 wgt Numeric 37.1 26.1 <NA> 3.14 117 11.7 34.6 59.6 ## 7 gen Factor NA NA G1 NA NA NA NA NA ## 8 phb Factor NA NA P1 NA NA NA NA NA ## 9 reg Factor NA NA south NA NA NA NA NA ## # ... with 2 more variables: n_missing <int>, n_unique <int>
Before merging this 10 datasets together into one, it would be nice to have a column with the id of the datasets. This can easily be done with a variant of purrr::map()
, called map2()
:
imp_boys_purrr <- map2(.x = seq(1,10), .y = imp_boys_purrr_complete, ~mutate(.y, imp_id = as.character(.x)))
map2()
applies a function, say f()
, to 2 lists sequentially: f(x_1, y_1)
, then f(x_2, y_2)
, etc… So here I map mutate()
to create a new column, imp_id
in each dataset. Now let’s bind the rows and take a look at the data:
imp_boys_purrr <- bind_rows(imp_boys_purrr) imp_boys_purrr %>% brotools::describe() %>% select(variable, type, n_missing, everything()) ## # A tibble: 10 x 12 ## variable type n_missing mean sd mode min max q25 median ## <chr> <chr> <int> <dbl> <dbl> <chr> <dbl> <dbl> <dbl> <dbl> ## 1 age Numer… 0 9.16 6.89 <NA> 0.0350 21.2 1.58 10.5 ## 2 bmi Numer… 0 18.0 3.04 <NA> 11.8 31.7 15.9 17.4 ## 3 hc Numer… 0 51.6 5.89 <NA> 33.7 65.0 48.3 53.2 ## 4 hgt Numer… 0 131 46.5 <NA> 50.0 198 83.0 146 ## 5 tv Numer… 0 8.42 8.11 <NA> 1.00 25.0 2.00 3.00 ## 6 wgt Numer… 0 37.1 26.0 <NA> 3.14 117 11.7 34.6 ## 7 imp_id Chara… 0 NA NA 1 NA NA NA NA ## 8 gen Factor 0 NA NA G1 NA NA NA NA ## 9 phb Factor 0 NA NA P1 NA NA NA NA ## 10 reg Factor 0 NA NA south NA NA NA NA ## # ... with 2 more variables: q75 <dbl>, n_unique <int>
You may ask yourself why I am bothering with all this. This will become apparent now. We can now use the code we wrote to get our 10 imputed datasets using purrr::map()
and simply use furrr::future_map()
to parallelize the imputation process:
library(furrr) ## Loading required package: future plan(multiprocess) start <- Sys.time() imp_boys_future <- future_map(rep(1, 10), ~mice(data = boys, m = ., maxit = 100, printFlag = FALSE)) end <- Sys.time() - start print(end) ## Time difference of 33.73772 secs
Boooom! Much faster! And simply by loading {furrr}
, then using plan(multiprocess)
to run the code in parallel (if you forget that, the code will run on a single core) and using future_map()
instead of map()
.
Let’s take a look at the data:
imp_boys_future_complete <- map(imp_boys_future, mice::complete) imp_boys_future <- map2(.x = seq(1,10), .y = imp_boys_future_complete, ~mutate(.y, imp_id = as.character(.x))) imp_boys_future <- bind_rows(imp_boys_future) imp_boys_future %>% brotools::describe() %>% select(variable, type, n_missing, everything()) ## # A tibble: 10 x 12 ## variable type n_missing mean sd mode min max q25 median ## <chr> <chr> <int> <dbl> <dbl> <chr> <dbl> <dbl> <dbl> <dbl> ## 1 age Numer… 0 9.16 6.89 <NA> 0.0350 21.2 1.58 10.5 ## 2 bmi Numer… 0 18.0 3.04 <NA> 11.8 31.7 15.9 17.4 ## 3 hc Numer… 0 51.6 5.89 <NA> 33.7 65.0 48.4 53.2 ## 4 hgt Numer… 0 131 46.5 <NA> 50.0 198 83.0 146 ## 5 tv Numer… 0 8.35 8.09 <NA> 1.00 25.0 2.00 3.00 ## 6 wgt Numer… 0 37.1 26.0 <NA> 3.14 117 11.7 34.6 ## 7 imp_id Chara… 0 NA NA 1 NA NA NA NA ## 8 gen Factor 0 NA NA G1 NA NA NA NA ## 9 phb Factor 0 NA NA P1 NA NA NA NA ## 10 reg Factor 0 NA NA south NA NA NA NA ## # ... with 2 more variables: q75 <dbl>, n_unique <int>
So imputation went from 3.4 minutes (around 200 seconds) to 30 seconds. How cool is that? If you want to play around with {furrr}
you must install it from Github, as it is not yet available on CRAN:
devtools::install_github("DavisVaughan/furrr")
If you are not comfortable with map()
(and thus future_map()
) but still want to impute in parallel, there is this very nice script here to do just that. I created a package around this script, called parlMICE (the same name as the script), to make installation and usage easier. You can install it with like so:
devtools::install_github("b-rodrigues/parlMICE")
If you found this blog post useful, you might want to follow me on twitter for blog post updates.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.