Site icon R-bloggers

Data Visualization of the WBL Index and Modeling with Quantile Regression using Random Forest

[This article was first published on DataGeeek, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

As we enter the new year, women still seem not to have equal rights compared to men in the working environment. This situation is more prominent in the developing and least developed countries. This article will examine that using the WBL (women, business, and law) index.

First, we will create a plot comparing WBL scores by region, excluding high-income economies.

library(tidyverse)
library(openxlsx)
library(ragg) #google  setting

df_wbl <- 
  openxlsx::read.xlsx("https://github.com/mesdi/blog/raw/main/wbl.xlsx",
                      sheet = "WBL Panel 2023") %>% 
  as_tibble() %>% 
  janitor::clean_names() 

#Plotting compared WBL index scores by region
#Excluded high-income OECD countries
df_wbl %>% 
  #remove the high income level countries
  filter(region != "High income: OECD") %>% 
  group_by(region) %>% 
  mutate(wbl_mean_index = mean(wbl_index)) %>% 
  select(wbl_mean_index) %>% 
  unique() %>% 
  ggplot(aes(wbl_mean_index, 
             reorder(region, wbl_mean_index), 
             fill = region)) +
  geom_col(width = 0.5) +
  geom_text(aes(label = .data$wbl_mean_index %>% round(2)),
            nudge_x = 2,
            family = "Bricolage Grotesque") +
  scale_x_continuous(position = "top") +
  labs(x = "", 
       y = "",
       title = "Comparing WBL Index Averages by Region, 1971-2023",
       subtitle = "Excluding high-income economies",
       #for two captions
       caption = c("Source: Women, Business, and the Law database" ,
                   "WBL: Women, Business, and the Law")) +
  theme_minimal(base_family = "Bricolage Grotesque",
                base_size = 16) +
  theme(panel.grid.minor = element_blank(),
        panel.grid.major.y = element_blank(),
        panel.grid.major.x = element_line(linewidth = 1.1),
        plot.title = element_text(hjust = 0.5, face = "bold"),
        plot.subtitle = element_text(hjust = 0.5, size = 11),
        plot.caption = element_text(hjust = c(0.06,1), #for two captions layout
                                    size = 10),
        legend.position = "none",
        plot.background = element_rect(fill = "#F0F0F0"))

It seems that Sub-Saharan Africa is making progress and has the potential to close the gap between itself and East Asia and the Pacific region. On the other hand, Latin America shows that they are not behind Europe at all.

Now, we are examining the reasons beyond the above results. To do this, we will use quantile regression via a random forest model.

#Modeling
library(tidymodels)
library(gt)

#Splitting into train and test sets
set.seed(1983)
data_split <- initial_split(df_wbl, 
                            strata = "wbl_index", 
                            prop = 0.8)

wbl_train <- training(data_split)
wbl_test  <- testing(data_split) 

#Recipe
wbl_rec <- 
  recipe(
    wbl_index ~ 
      region + 
      income_group  + 
      length_of_paid_paternity_leave + 
      length_of_paid_maternity_leave, 
    data = wbl_train
  )

#Quantile regression via random forest from ranger package
wbl_mod <- 
  rand_forest() %>%
  set_engine("ranger", 
             importance = "permutation", # for variable importance, 
             seed = 12345, 
             quantreg = TRUE # for quantile regression ) %>%
  set_mode("regression")

set.seed(98765)
wbl_wflw <- 
  workflow() %>% 
  add_model(wbl_mod) %>% 
  add_recipe(wbl_rec) %>% 
  fit(wbl_train)

#The function of extracting predictions
preds_bind <- function(data_fit, lower = 0.05, upper = 0.95){
  predict(
    wbl_wflw$fit$fit$fit, 
    workflows::extract_recipe(wbl_wflw) %>% bake(data_fit),
    type = "quantiles",
    quantiles = c(lower, upper, 0.50)
  ) %>% 
    with(predictions) %>% #extracts predictions of Ranger prediction object
    as_tibble() %>% 
    set_names(paste0(".pred", c("_lower", "_upper",  ""))) %>% 
    bind_cols(data_fit) %>% 
    select(contains(".pred"), wbl_index)
}


#Accuracy of train and test set
wbl_preds_train <- preds_bind(wbl_train)
wbl_preds_test <- preds_bind(wbl_test)

bind_rows(
  yardstick::rsq(wbl_preds_train, wbl_index , .pred),
  yardstick::rsq(wbl_preds_test, wbl_index, .pred)
) %>% 
  mutate(dataset = c("training", "test"))

# A tibble: 2 × 4
  .metric .estimator .estimate dataset 
  <chr>   <chr>          <dbl> <chr>   
1 rsq     standard       0.728 training
2 rsq     standard       0.718 test

The accuracy results look fine for both the train and test sets. Hence, we can make a variable importance calculations with this model.

#Variable importance
library(DALEXtra)

#Creating a preprocessed dataframe of the train dataset
imp_wbl <- 
  wbl_rec %>%
  prep() %>%
  bake(new_data = NULL)

#Explainer
explainer_wbl <- 
  explain_tidymodels(
    wbl_wflw, 
    data = imp_wbl %>% select(-wbl_index), 
    y = imp_wbl$wbl_index,
    label = "",
    verbose = FALSE
  )

#Computing the permutation-based variable-importance measure
set.seed(12345)
vip_wbl <- model_parts(explainer_wbl, 
                       B = 100,
                       loss_function = loss_root_mean_square)


#Plotting variable importance 

#Averaged RMSE value for the full model
wbl_dropout <- 
  vip_wbl %>% 
  filter(variable == "_full_model_") %>% 
  summarise(dropout_loss = mean(dropout_loss))

vip_wbl %>%
  filter(variable != "_full_model_",
         variable != "_baseline_") %>% 
  mutate(label = str_replace_all(variable, "_", " ") %>% str_to_title(),
         label = fct_reorder(label, dropout_loss)) %>%
  ggplot(aes(dropout_loss, label)) +
  geom_vline(data = wbl_dropout, 
             aes(xintercept = dropout_loss),
             linewidth = 1.4, 
             lty = 2, 
             alpha = 0.7,
             color = "red") +
  geom_boxplot(fill = "#91CBD765", 
               alpha = 0.4) +
  labs(x = "",
       y = "",
       title = "Root mean square error (RMSE) after 100 permutations",
       subtitle = "The <span style='color:red'>dashed line</span> shows the RMSE for the full model",
       caption = "Higher indicates more important") +
  theme_minimal(base_family = "Bricolage Grotesque") +
  theme(plot.title = element_text(face = "bold"),
        plot.subtitle = ggtext::element_markdown(),
        plot.caption = element_text(hjust = 0.5),
        axis.text.y = element_text(size = 11),
        plot.background = element_rect(fill = "#F0F0F0"))

According to the graph, the effects of paid paternity and maternity leave seem to be very close, which is pretty interesting. The most prominent effect belongs to the region variable that shows the geo-cultural effects are important to determine the WBL index score.

To leave a comment for the author, please follow the link and comment on their blog: DataGeeek.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Exit mobile version