Data Visualization of the WBL Index and Modeling with Quantile Regression using Random Forest
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
As we enter the new year, women still seem not to have equal rights compared to men in the working environment. This situation is more prominent in the developing and least developed countries. This article will examine that using the WBL (women, business, and law) index.
First, we will create a plot comparing WBL scores by region, excluding high-income economies.
library(tidyverse) library(openxlsx) library(ragg) #google font setting df_wbl <- openxlsx::read.xlsx("https://github.com/mesdi/blog/raw/main/wbl.xlsx", sheet = "WBL Panel 2023") %>% as_tibble() %>% janitor::clean_names() #Plotting compared WBL index scores by region #Excluded high-income OECD countries df_wbl %>% #remove the high income level countries filter(region != "High income: OECD") %>% group_by(region) %>% mutate(wbl_mean_index = mean(wbl_index)) %>% select(wbl_mean_index) %>% unique() %>% ggplot(aes(wbl_mean_index, reorder(region, wbl_mean_index), fill = region)) + geom_col(width = 0.5) + geom_text(aes(label = .data$wbl_mean_index %>% round(2)), nudge_x = 2, family = "Bricolage Grotesque") + scale_x_continuous(position = "top") + labs(x = "", y = "", title = "Comparing WBL Index Averages by Region, 1971-2023", subtitle = "Excluding high-income economies", #for two captions caption = c("Source: Women, Business, and the Law database" , "WBL: Women, Business, and the Law")) + theme_minimal(base_family = "Bricolage Grotesque", base_size = 16) + theme(panel.grid.minor = element_blank(), panel.grid.major.y = element_blank(), panel.grid.major.x = element_line(linewidth = 1.1), plot.title = element_text(hjust = 0.5, face = "bold"), plot.subtitle = element_text(hjust = 0.5, size = 11), plot.caption = element_text(hjust = c(0.06,1), #for two captions layout size = 10), legend.position = "none", plot.background = element_rect(fill = "#F0F0F0"))
It seems that Sub-Saharan Africa is making progress and has the potential to close the gap between itself and East Asia and the Pacific region. On the other hand, Latin America shows that they are not behind Europe at all.
Now, we are examining the reasons beyond the above results. To do this, we will use quantile regression via a random forest model.
#Modeling library(tidymodels) library(gt) #Splitting into train and test sets set.seed(1983) data_split <- initial_split(df_wbl, strata = "wbl_index", prop = 0.8) wbl_train <- training(data_split) wbl_test <- testing(data_split) #Recipe wbl_rec <- recipe( wbl_index ~ region + income_group + length_of_paid_paternity_leave + length_of_paid_maternity_leave, data = wbl_train ) #Quantile regression via random forest from ranger package wbl_mod <- rand_forest() %>% set_engine("ranger", importance = "permutation", # for variable importance, seed = 12345, quantreg = TRUE # for quantile regression ) %>% set_mode("regression") set.seed(98765) wbl_wflw <- workflow() %>% add_model(wbl_mod) %>% add_recipe(wbl_rec) %>% fit(wbl_train) #The function of extracting predictions preds_bind <- function(data_fit, lower = 0.05, upper = 0.95){ predict( wbl_wflw$fit$fit$fit, workflows::extract_recipe(wbl_wflw) %>% bake(data_fit), type = "quantiles", quantiles = c(lower, upper, 0.50) ) %>% with(predictions) %>% #extracts predictions of Ranger prediction object as_tibble() %>% set_names(paste0(".pred", c("_lower", "_upper", ""))) %>% bind_cols(data_fit) %>% select(contains(".pred"), wbl_index) } #Accuracy of train and test set wbl_preds_train <- preds_bind(wbl_train) wbl_preds_test <- preds_bind(wbl_test) bind_rows( yardstick::rsq(wbl_preds_train, wbl_index , .pred), yardstick::rsq(wbl_preds_test, wbl_index, .pred) ) %>% mutate(dataset = c("training", "test")) # A tibble: 2 × 4 .metric .estimator .estimate dataset <chr> <chr> <dbl> <chr> 1 rsq standard 0.728 training 2 rsq standard 0.718 test
The accuracy results look fine for both the train and test sets. Hence, we can make a variable importance calculations with this model.
#Variable importance library(DALEXtra) #Creating a preprocessed dataframe of the train dataset imp_wbl <- wbl_rec %>% prep() %>% bake(new_data = NULL) #Explainer explainer_wbl <- explain_tidymodels( wbl_wflw, data = imp_wbl %>% select(-wbl_index), y = imp_wbl$wbl_index, label = "", verbose = FALSE ) #Computing the permutation-based variable-importance measure set.seed(12345) vip_wbl <- model_parts(explainer_wbl, B = 100, loss_function = loss_root_mean_square) #Plotting variable importance #Averaged RMSE value for the full model wbl_dropout <- vip_wbl %>% filter(variable == "_full_model_") %>% summarise(dropout_loss = mean(dropout_loss)) vip_wbl %>% filter(variable != "_full_model_", variable != "_baseline_") %>% mutate(label = str_replace_all(variable, "_", " ") %>% str_to_title(), label = fct_reorder(label, dropout_loss)) %>% ggplot(aes(dropout_loss, label)) + geom_vline(data = wbl_dropout, aes(xintercept = dropout_loss), linewidth = 1.4, lty = 2, alpha = 0.7, color = "red") + geom_boxplot(fill = "#91CBD765", alpha = 0.4) + labs(x = "", y = "", title = "Root mean square error (RMSE) after 100 permutations", subtitle = "The <span style='color:red'>dashed line</span> shows the RMSE for the full model", caption = "Higher indicates more important") + theme_minimal(base_family = "Bricolage Grotesque") + theme(plot.title = element_text(face = "bold"), plot.subtitle = ggtext::element_markdown(), plot.caption = element_text(hjust = 0.5), axis.text.y = element_text(size = 11), plot.background = element_rect(fill = "#F0F0F0"))
According to the graph, the effects of paid paternity and maternity leave seem to be very close, which is pretty interesting. The most prominent effect belongs to the region
variable that shows the geo-cultural effects are important to determine the WBL index score.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.