Site icon R-bloggers

Imputing missing values in parallel using {furrr}

[This article was first published on Econometrics and Free Software, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Today I saw this tweet on my timeline:

For those of us that just can't wait until RStudio officially supports parallel purrr in #rstats, boy have I got something for you.

Introducing `furrr`, parallel purrr through the use of futures. Go ahead, break things, you know you want to:https://t.co/l9z1UC2Tew

— Davis Vaughan (@dvaughan32) April 13, 2018

and as a heavy {purrr} user, as well as the happy owner of a 6-core AMD Ryzen 5 1600X cpu, I was very excited to try out {furrr}. For those unfamiliar with {purrr}, you can read some of my previous blog posts on it here, here or here.

To summarize very quickly: {purrr} contains so-called higher order functions, which are functions that take other functions as argument. One such function is map(). Consider the following simple example:

numbers <- seq(1, 10)

If you want the square root of this numbers, you can of course simply use the sqrt() function, because it is vectorized:

sqrt(numbers)
##  [1] 1.000000 1.414214 1.732051 2.000000 2.236068 2.449490 2.645751
##  [8] 2.828427 3.000000 3.162278

But in a lot of situations, the solution is not so simple. Sometimes you have to loop over the values. This is what we would need to do if sqrt() was not vectorized:

sqrt_numbers <- rep(0, 10)

for(i in length(numbers)){
  sqrt_numbers[i] <- sqrt(numbers[i])
}

First, you need to initialize a container, and then you have to populate the sqrt_numbers list with the results. Using, {purrr} is way easier:

library(tidyverse)
map(numbers, sqrt)
## [[1]]
## [1] 1
## 
## [[2]]
## [1] 1.414214
## 
## [[3]]
## [1] 1.732051
## 
## [[4]]
## [1] 2
## 
## [[5]]
## [1] 2.236068
## 
## [[6]]
## [1] 2.44949
## 
## [[7]]
## [1] 2.645751
## 
## [[8]]
## [1] 2.828427
## 
## [[9]]
## [1] 3
## 
## [[10]]
## [1] 3.162278

map() is only one of the nice functions that are bundled inside {purrr}. Mastering {purrr} can really make you a much more efficient R programmer. Anyways, recently, I have been playing around with imputation and the {mice} package. {mice} comes with an example dataset called boys, let’s take a look at it:

library(mice)

data(boys)

brotools::describe(boys) %>%
  select(variable, type, n_missing, everything())
## # A tibble: 9 x 12
##   variable type    n_missing   mean    sd mode      min   max   q25 median
##   <chr>    <chr>       <int>  <dbl> <dbl> <chr>   <dbl> <dbl> <dbl>  <dbl>
## 1 age      Numeric         0   9.16  6.89 <NA>   0.0350  21.2  1.58   10.5
## 2 bmi      Numeric        21  18.1   3.05 <NA>  11.8     31.7 15.9    17.4
## 3 hc       Numeric        46  51.5   5.91 <NA>  33.7     65.0 48.1    53.0
## 4 hgt      Numeric        20 132    46.5  <NA>  50.0    198   84.9   147  
## 5 tv       Numeric       522  11.9   7.99 <NA>   1.00    25.0  4.00   12.0
## 6 wgt      Numeric         4  37.2  26.0  <NA>   3.14   117   11.7    34.7
## 7 gen      Factor        503  NA    NA    <NA>  NA       NA   NA      NA  
## 8 phb      Factor        503  NA    NA    <NA>  NA       NA   NA      NA  
## 9 reg      Factor          3  NA    NA    south NA       NA   NA      NA  
## # ... with 2 more variables: q75 <dbl>, n_unique <int>

In the code above I use the describe() function from my personal package to get some summary statistics of the boys dataset (you can read more about this function here). I am especially interested in the number of missing values, which is why I re-order the columns. If I did not re-order the columns, it would not appear in the output on my blog.

We see that some columns have a lot of missing values. Using the mice function, it is very easy to impute them:

start <- Sys.time()
imp_boys <- mice(boys, m = 10, maxit = 100, printFlag = FALSE)
end <- Sys.time() - start

print(end)
## Time difference of 3.290611 mins

Imputation on a single core took around 3 minutes on my computer. This might seem ok, but if you have a larger data set with more variables, 3 minutes can become 3 hours. And if you increase maxit, which helps convergence, or the number of imputations, 3 hours can become 30 hours. With a 6-core CPU this could potentially be brought down to 5 hours (in theory). Let’s see if we can go faster, but first let’s take a look at the imputed data.

The mice() function returns a mids object. If you want to look at the data, you have to use the complete() function (careful, there is also a complete() function in the {tidyr} package, so to avoid problems, I suggest you explicitely call mice::complete()):

imp_boys <- mice::complete(imp_boys, "long")

brotools::describe(imp_boys) %>%
  select(variable, type, n_missing, everything())
## # A tibble: 11 x 12
##    variable type   n_missing   mean    sd mode      min   max   q25 median
##    <chr>    <chr>      <int>  <dbl> <dbl> <chr>   <dbl> <dbl> <dbl>  <dbl>
##  1 age      Numer…         0   9.16  6.89 <NA>   0.0350  21.2  1.58  10.5 
##  2 bmi      Numer…         0  18.0   3.03 <NA>  11.8     31.7 15.9   17.4 
##  3 hc       Numer…         0  51.6   5.89 <NA>  33.7     65.0 48.3   53.2 
##  4 hgt      Numer…         0 131    46.5  <NA>  50.0    198   83.0  146   
##  5 tv       Numer…         0   8.39  8.09 <NA>   1.00    25.0  2.00   3.00
##  6 wgt      Numer…         0  37.1  26.0  <NA>   3.14   117   11.7   34.6 
##  7 .id      Factor         0  NA    NA    3     NA       NA   NA     NA   
##  8 .imp     Factor         0  NA    NA    1     NA       NA   NA     NA   
##  9 gen      Factor         0  NA    NA    G1    NA       NA   NA     NA   
## 10 phb      Factor         0  NA    NA    P1    NA       NA   NA     NA   
## 11 reg      Factor         0  NA    NA    south NA       NA   NA     NA   
## # ... with 2 more variables: q75 <dbl>, n_unique <int>

As expected, no more missing values. The “long” argument inside mice::complete() is needed if you want the complete() function to return a long dataset. Doing the above “manually” using {purrr} is possible with the following code:

start <- Sys.time()
imp_boys_purrr <- map(rep(1, 10), ~mice(data = boys, m = ., maxit = 100, printFlag = FALSE))
end <- Sys.time() - start

print(end)
## Time difference of 3.393966 mins

What this does is map the function ~mice(data = boys, m = ., maxit = 100, printFlag = FALSE) to a list of 1s, and creates 10 imputed data sets. m = . means that m will be equal to whatever is inside the list we are mapping our function over, so 1, then 1 then another 1 etc…. It took around the same amount of time as using mice() directly.

imp_boys_purrr is now a list of 10 mids objects. We thus need to map mice::complete() to imp_boys_purrr to get the data:

imp_boys_purrr_complete <- map(imp_boys_purrr, mice::complete)

Now, imp_boys_purrr_complete is a list of 10 datasets. Let’s map brotools::describe() to it:

map(imp_boys_purrr_complete, brotools::describe)
## [[1]]
## # A tibble: 9 x 12
##   variable type      mean    sd mode      min   max   q25 median   q75
##   <chr>    <chr>    <dbl> <dbl> <chr>   <dbl> <dbl> <dbl>  <dbl> <dbl>
## 1 age      Numeric   9.16  6.89 <NA>   0.0350  21.2  1.58  10.5   15.3
## 2 bmi      Numeric  18.0   3.03 <NA>  11.8     31.7 15.9   17.4   19.5
## 3 hc       Numeric  51.7   5.90 <NA>  33.7     65.0 48.3   53.1   56.0
## 4 hgt      Numeric 131    46.5  <NA>  50.0    198   84.0  146    175  
## 5 tv       Numeric   8.35  8.00 <NA>   1.00    25.0  2.00   3.00  15.0
## 6 wgt      Numeric  37.2  26.0  <NA>   3.14   117   11.7   34.7   59.6
## 7 gen      Factor   NA    NA    G1    NA       NA   NA     NA     NA  
## 8 phb      Factor   NA    NA    P1    NA       NA   NA     NA     NA  
## 9 reg      Factor   NA    NA    south NA       NA   NA     NA     NA  
## # ... with 2 more variables: n_missing <int>, n_unique <int>
## 
## [[2]]
## # A tibble: 9 x 12
##   variable type      mean    sd mode      min   max   q25 median   q75
##   <chr>    <chr>    <dbl> <dbl> <chr>   <dbl> <dbl> <dbl>  <dbl> <dbl>
## 1 age      Numeric   9.16  6.89 <NA>   0.0350  21.2  1.58  10.5   15.3
## 2 bmi      Numeric  18.0   3.03 <NA>  11.8     31.7 15.9   17.5   19.5
## 3 hc       Numeric  51.6   5.88 <NA>  33.7     65.0 48.3   53.2   56.0
## 4 hgt      Numeric 131    46.6  <NA>  50.0    198   83.5  145    175  
## 5 tv       Numeric   8.37  8.02 <NA>   1.00    25.0  2.00   3.00  15.0
## 6 wgt      Numeric  37.1  26.0  <NA>   3.14   117   11.9   34.6   59.6
## 7 gen      Factor   NA    NA    G1    NA       NA   NA     NA     NA  
## 8 phb      Factor   NA    NA    P2    NA       NA   NA     NA     NA  
## 9 reg      Factor   NA    NA    south NA       NA   NA     NA     NA  
## # ... with 2 more variables: n_missing <int>, n_unique <int>
## 
## [[3]]
## # A tibble: 9 x 12
##   variable type      mean    sd mode      min   max   q25 median   q75
##   <chr>    <chr>    <dbl> <dbl> <chr>   <dbl> <dbl> <dbl>  <dbl> <dbl>
## 1 age      Numeric   9.16  6.89 <NA>   0.0350  21.2  1.58  10.5   15.3
## 2 bmi      Numeric  18.1   3.04 <NA>  11.8     31.7 15.9   17.5   19.5
## 3 hc       Numeric  51.6   5.87 <NA>  33.7     65.0 48.5   53.3   56.0
## 4 hgt      Numeric 131    46.6  <NA>  50.0    198   83.0  145    175  
## 5 tv       Numeric   8.46  8.14 <NA>   1.00    25.0  2.00   3.00  15.0
## 6 wgt      Numeric  37.2  26.1  <NA>   3.14   117   11.7   34.6   59.6
## 7 gen      Factor   NA    NA    G1    NA       NA   NA     NA     NA  
## 8 phb      Factor   NA    NA    P1    NA       NA   NA     NA     NA  
## 9 reg      Factor   NA    NA    south NA       NA   NA     NA     NA  
## # ... with 2 more variables: n_missing <int>, n_unique <int>
## 
## [[4]]
## # A tibble: 9 x 12
##   variable type      mean    sd mode      min   max   q25 median   q75
##   <chr>    <chr>    <dbl> <dbl> <chr>   <dbl> <dbl> <dbl>  <dbl> <dbl>
## 1 age      Numeric   9.16  6.89 <NA>   0.0350  21.2  1.58  10.5   15.3
## 2 bmi      Numeric  18.1   3.02 <NA>  11.8     31.7 15.9   17.5   19.4
## 3 hc       Numeric  51.7   5.93 <NA>  33.7     65.0 48.5   53.4   56.0
## 4 hgt      Numeric 131    46.5  <NA>  50.0    198   82.9  145    175  
## 5 tv       Numeric   8.45  8.11 <NA>   1.00    25.0  2.00   3.00  15.0
## 6 wgt      Numeric  37.2  26.0  <NA>   3.14   117   11.7   34.7   59.6
## 7 gen      Factor   NA    NA    G1    NA       NA   NA     NA     NA  
## 8 phb      Factor   NA    NA    P1    NA       NA   NA     NA     NA  
## 9 reg      Factor   NA    NA    south NA       NA   NA     NA     NA  
## # ... with 2 more variables: n_missing <int>, n_unique <int>
## 
## [[5]]
## # A tibble: 9 x 12
##   variable type      mean    sd mode      min   max   q25 median   q75
##   <chr>    <chr>    <dbl> <dbl> <chr>   <dbl> <dbl> <dbl>  <dbl> <dbl>
## 1 age      Numeric   9.16  6.89 <NA>   0.0350  21.2  1.58  10.5   15.3
## 2 bmi      Numeric  18.0   3.03 <NA>  11.8     31.7 15.9   17.5   19.5
## 3 hc       Numeric  51.6   5.91 <NA>  33.7     65.0 48.3   53.2   56.0
## 4 hgt      Numeric 131    46.6  <NA>  50.0    198   83.0  146    175  
## 5 tv       Numeric   8.21  8.02 <NA>   1.00    25.0  2.00   3.00  15.0
## 6 wgt      Numeric  37.1  26.0  <NA>   3.14   117   11.7   34.6   59.6
## 7 gen      Factor   NA    NA    G1    NA       NA   NA     NA     NA  
## 8 phb      Factor   NA    NA    P1    NA       NA   NA     NA     NA  
## 9 reg      Factor   NA    NA    south NA       NA   NA     NA     NA  
## # ... with 2 more variables: n_missing <int>, n_unique <int>
## 
## [[6]]
## # A tibble: 9 x 12
##   variable type      mean    sd mode      min   max   q25 median   q75
##   <chr>    <chr>    <dbl> <dbl> <chr>   <dbl> <dbl> <dbl>  <dbl> <dbl>
## 1 age      Numeric   9.16  6.89 <NA>   0.0350  21.2  1.58  10.5   15.3
## 2 bmi      Numeric  18.0   3.05 <NA>  11.8     31.7 15.9   17.4   19.5
## 3 hc       Numeric  51.7   5.89 <NA>  33.7     65.0 48.3   53.2   56.0
## 4 hgt      Numeric 131    46.5  <NA>  50.0    198   83.0  146    175  
## 5 tv       Numeric   8.44  8.24 <NA>   1.00    25.0  2.00   3.00  15.0
## 6 wgt      Numeric  37.1  26.0  <NA>   3.14   117   11.7   34.6   59.6
## 7 gen      Factor   NA    NA    G1    NA       NA   NA     NA     NA  
## 8 phb      Factor   NA    NA    P1    NA       NA   NA     NA     NA  
## 9 reg      Factor   NA    NA    south NA       NA   NA     NA     NA  
## # ... with 2 more variables: n_missing <int>, n_unique <int>
## 
## [[7]]
## # A tibble: 9 x 12
##   variable type      mean    sd mode      min   max   q25 median   q75
##   <chr>    <chr>    <dbl> <dbl> <chr>   <dbl> <dbl> <dbl>  <dbl> <dbl>
## 1 age      Numeric   9.16  6.89 <NA>   0.0350  21.2  1.58  10.5   15.3
## 2 bmi      Numeric  18.1   3.04 <NA>  11.8     31.7 15.9   17.4   19.5
## 3 hc       Numeric  51.6   5.88 <NA>  33.7     65.0 48.2   53.2   56.0
## 4 hgt      Numeric 131    46.6  <NA>  50.0    198   83.5  146    175  
## 5 tv       Numeric   8.47  8.15 <NA>   1.00    25.0  2.00   3.00  15.0
## 6 wgt      Numeric  37.2  26.1  <NA>   3.14   117   11.7   34.6   59.6
## 7 gen      Factor   NA    NA    G1    NA       NA   NA     NA     NA  
## 8 phb      Factor   NA    NA    P1    NA       NA   NA     NA     NA  
## 9 reg      Factor   NA    NA    south NA       NA   NA     NA     NA  
## # ... with 2 more variables: n_missing <int>, n_unique <int>
## 
## [[8]]
## # A tibble: 9 x 12
##   variable type      mean    sd mode      min   max   q25 median   q75
##   <chr>    <chr>    <dbl> <dbl> <chr>   <dbl> <dbl> <dbl>  <dbl> <dbl>
## 1 age      Numeric   9.16  6.89 <NA>   0.0350  21.2  1.58  10.5   15.3
## 2 bmi      Numeric  18.0   3.04 <NA>  11.8     31.7 15.9   17.4   19.4
## 3 hc       Numeric  51.6   5.85 <NA>  33.7     65.0 48.2   53.3   56.0
## 4 hgt      Numeric 131    46.5  <NA>  50.0    198   83.0  146    175  
## 5 tv       Numeric   8.36  8.06 <NA>   1.00    25.0  2.00   3.00  15.0
## 6 wgt      Numeric  37.2  26.1  <NA>   3.14   117   11.7   34.6   59.6
## 7 gen      Factor   NA    NA    G1    NA       NA   NA     NA     NA  
## 8 phb      Factor   NA    NA    P1    NA       NA   NA     NA     NA  
## 9 reg      Factor   NA    NA    south NA       NA   NA     NA     NA  
## # ... with 2 more variables: n_missing <int>, n_unique <int>
## 
## [[9]]
## # A tibble: 9 x 12
##   variable type      mean    sd mode      min   max   q25 median   q75
##   <chr>    <chr>    <dbl> <dbl> <chr>   <dbl> <dbl> <dbl>  <dbl> <dbl>
## 1 age      Numeric   9.16  6.89 <NA>   0.0350  21.2  1.58  10.5   15.3
## 2 bmi      Numeric  18.0   3.05 <NA>  11.8     31.7 15.9   17.4   19.5
## 3 hc       Numeric  51.6   5.90 <NA>  33.7     65.0 48.3   53.2   56.0
## 4 hgt      Numeric 131    46.6  <NA>  50.0    198   83.9  146    175  
## 5 tv       Numeric   8.57  8.25 <NA>   1.00    25.0  2.00   3.00  15.0
## 6 wgt      Numeric  37.1  26.1  <NA>   3.14   117   11.7   34.6   59.6
## 7 gen      Factor   NA    NA    G1    NA       NA   NA     NA     NA  
## 8 phb      Factor   NA    NA    P1    NA       NA   NA     NA     NA  
## 9 reg      Factor   NA    NA    south NA       NA   NA     NA     NA  
## # ... with 2 more variables: n_missing <int>, n_unique <int>
## 
## [[10]]
## # A tibble: 9 x 12
##   variable type      mean    sd mode      min   max   q25 median   q75
##   <chr>    <chr>    <dbl> <dbl> <chr>   <dbl> <dbl> <dbl>  <dbl> <dbl>
## 1 age      Numeric   9.16  6.89 <NA>   0.0350  21.2  1.58  10.5   15.3
## 2 bmi      Numeric  18.0   3.04 <NA>  11.8     31.7 15.9   17.4   19.5
## 3 hc       Numeric  51.6   5.89 <NA>  33.7     65.0 48.3   53.1   56.0
## 4 hgt      Numeric 131    46.6  <NA>  50.0    198   83.0  146    175  
## 5 tv       Numeric   8.49  8.18 <NA>   1.00    25.0  2.00   3.00  15.0
## 6 wgt      Numeric  37.1  26.1  <NA>   3.14   117   11.7   34.6   59.6
## 7 gen      Factor   NA    NA    G1    NA       NA   NA     NA     NA  
## 8 phb      Factor   NA    NA    P1    NA       NA   NA     NA     NA  
## 9 reg      Factor   NA    NA    south NA       NA   NA     NA     NA  
## # ... with 2 more variables: n_missing <int>, n_unique <int>

Before merging this 10 datasets together into one, it would be nice to have a column with the id of the datasets. This can easily be done with a variant of purrr::map(), called map2():

imp_boys_purrr <- map2(.x = seq(1,10), .y = imp_boys_purrr_complete, ~mutate(.y, imp_id = as.character(.x)))

map2() applies a function, say f(), to 2 lists sequentially: f(x_1, y_1), then f(x_2, y_2), etc… So here I map mutate() to create a new column, imp_id in each dataset. Now let’s bind the rows and take a look at the data:

imp_boys_purrr <- bind_rows(imp_boys_purrr)

imp_boys_purrr %>%
  brotools::describe() %>%
  select(variable, type, n_missing, everything())
## # A tibble: 10 x 12
##    variable type   n_missing   mean    sd mode      min   max   q25 median
##    <chr>    <chr>      <int>  <dbl> <dbl> <chr>   <dbl> <dbl> <dbl>  <dbl>
##  1 age      Numer…         0   9.16  6.89 <NA>   0.0350  21.2  1.58  10.5 
##  2 bmi      Numer…         0  18.0   3.04 <NA>  11.8     31.7 15.9   17.4 
##  3 hc       Numer…         0  51.6   5.89 <NA>  33.7     65.0 48.3   53.2 
##  4 hgt      Numer…         0 131    46.5  <NA>  50.0    198   83.0  146   
##  5 tv       Numer…         0   8.42  8.11 <NA>   1.00    25.0  2.00   3.00
##  6 wgt      Numer…         0  37.1  26.0  <NA>   3.14   117   11.7   34.6 
##  7 imp_id   Chara…         0  NA    NA    1     NA       NA   NA     NA   
##  8 gen      Factor         0  NA    NA    G1    NA       NA   NA     NA   
##  9 phb      Factor         0  NA    NA    P1    NA       NA   NA     NA   
## 10 reg      Factor         0  NA    NA    south NA       NA   NA     NA   
## # ... with 2 more variables: q75 <dbl>, n_unique <int>

You may ask yourself why I am bothering with all this. This will become apparent now. We can now use the code we wrote to get our 10 imputed datasets using purrr::map() and simply use furrr::future_map() to parallelize the imputation process:

library(furrr)
## Loading required package: future
plan(multiprocess)

start <- Sys.time()
imp_boys_future <- future_map(rep(1, 10), ~mice(data = boys, m = ., maxit = 100, printFlag = FALSE))
end <- Sys.time() - start

print(end)
## Time difference of 33.73772 secs

Boooom! Much faster! And simply by loading {furrr}, then using plan(multiprocess) to run the code in parallel (if you forget that, the code will run on a single core) and using future_map() instead of map().

Let’s take a look at the data:

imp_boys_future_complete <- map(imp_boys_future, mice::complete)

imp_boys_future <- map2(.x = seq(1,10), .y = imp_boys_future_complete, ~mutate(.y, imp_id = as.character(.x)))

imp_boys_future <- bind_rows(imp_boys_future)

imp_boys_future %>%
  brotools::describe() %>%
  select(variable, type, n_missing, everything())
## # A tibble: 10 x 12
##    variable type   n_missing   mean    sd mode      min   max   q25 median
##    <chr>    <chr>      <int>  <dbl> <dbl> <chr>   <dbl> <dbl> <dbl>  <dbl>
##  1 age      Numer…         0   9.16  6.89 <NA>   0.0350  21.2  1.58  10.5 
##  2 bmi      Numer…         0  18.0   3.04 <NA>  11.8     31.7 15.9   17.4 
##  3 hc       Numer…         0  51.6   5.89 <NA>  33.7     65.0 48.4   53.2 
##  4 hgt      Numer…         0 131    46.5  <NA>  50.0    198   83.0  146   
##  5 tv       Numer…         0   8.35  8.09 <NA>   1.00    25.0  2.00   3.00
##  6 wgt      Numer…         0  37.1  26.0  <NA>   3.14   117   11.7   34.6 
##  7 imp_id   Chara…         0  NA    NA    1     NA       NA   NA     NA   
##  8 gen      Factor         0  NA    NA    G1    NA       NA   NA     NA   
##  9 phb      Factor         0  NA    NA    P1    NA       NA   NA     NA   
## 10 reg      Factor         0  NA    NA    south NA       NA   NA     NA   
## # ... with 2 more variables: q75 <dbl>, n_unique <int>

So imputation went from 3.4 minutes (around 200 seconds) to 30 seconds. How cool is that? If you want to play around with {furrr} you must install it from Github, as it is not yet available on CRAN:

devtools::install_github("DavisVaughan/furrr")

If you are not comfortable with map() (and thus future_map()) but still want to impute in parallel, there is this very nice script here to do just that. I created a package around this script, called parlMICE (the same name as the script), to make installation and usage easier. You can install it with like so:

devtools::install_github("b-rodrigues/parlMICE")

If you found this blog post useful, you might want to follow me on twitter for blog post updates.

To leave a comment for the author, please follow the link and comment on their blog: Econometrics and Free Software.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.