Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Congratulations to Joshua Angrist and Guido Imbens for the Nobel Prize for their work with causality. This post will also be about causality, although not from the work of Angrist or Imbens. A year ago I read The Book of Why by Judea Pearl and Dana MacKenzie. The Book of Why states that you can make causal conclusions from observational data if you know the directed acyclical graph (DAG) of the processes that has created the data. The Book of Why also states that you can not create the DAG based on data alone, i.e. you can not have the data as an input to an algorithm and get the DAG and its causal implications as output. The Book of Why also explains that Judea Pearl has studied Bayesian networks before he began his work with DAGs and causality. I hypothesise that some DAGS are more probable than other DAGs based on the statistics of the data. I am examining if there are ways to get boundaries of the causal effects that variables can have on each other within a limited system. I will use structure learning algorithms for Bayesian networks. I will take no regard to unmeasured confounders. This work is ongoing and the results are as is. I will use data from Statistics Sweden, for more information about the data see my previous posts.
Statistics Sweden use NUTS (Nomenclature des Unités Territoriales Statistiques), which is the EU’s hierarchical regional division, to specify the regions.
First, define libraries and functions.
library(tidyverse) ## -- Attaching packages --------------------------------------- tidyverse 1.3.1 -- ## v ggplot2 3.3.5 v purrr 0.3.4 ## v tibble 3.1.5 v dplyr 1.0.7 ## v tidyr 1.1.4 v stringr 1.4.0 ## v readr 2.0.2 v forcats 0.5.1 ## Warning: package 'ggplot2' was built under R version 4.0.5 ## Warning: package 'tibble' was built under R version 4.0.5 ## Warning: package 'tidyr' was built under R version 4.0.5 ## Warning: package 'readr' was built under R version 4.0.5 ## Warning: package 'dplyr' was built under R version 4.0.5 ## Warning: package 'forcats' was built under R version 4.0.3 ## -- Conflicts ------------------------------------------ tidyverse_conflicts() -- ## x dplyr::filter() masks stats::filter() ## x dplyr::lag() masks stats::lag() library(gtools) ## Warning: package 'gtools' was built under R version 4.0.5 library(pcalg) ## Warning: package 'pcalg' was built under R version 4.0.5 library(imputeMissings) ## Warning: package 'imputeMissings' was built under R version 4.0.5 ## ## Attaching package: 'imputeMissings' ## The following object is masked from 'package:dplyr': ## ## compute library(bnlearn) ## Warning: package 'bnlearn' was built under R version 4.0.5 ## ## Attaching package: 'bnlearn' ## The following object is masked from 'package:imputeMissings': ## ## impute ## The following objects are masked from 'package:pcalg': ## ## dsep, pdag2dag, shd, skeleton library(dagitty) ## Warning: package 'dagitty' was built under R version 4.0.4 ## ## Attaching package: 'dagitty' ## The following objects are masked from 'package:bnlearn': ## ## ancestors, children, descendants, parents, spouses ## The following object is masked from 'package:pcalg': ## ## randomDAG library(AER) ## Warning: package 'AER' was built under R version 4.0.5 ## Loading required package: car ## Warning: package 'car' was built under R version 4.0.5 ## Loading required package: carData ## Warning: package 'carData' was built under R version 4.0.3 ## ## Attaching package: 'car' ## The following object is masked from 'package:gtools': ## ## logit ## The following object is masked from 'package:dplyr': ## ## recode ## The following object is masked from 'package:purrr': ## ## some ## Loading required package: lmtest ## Warning: package 'lmtest' was built under R version 4.0.4 ## Loading required package: zoo ## Warning: package 'zoo' was built under R version 4.0.5 ## ## Attaching package: 'zoo' ## The following objects are masked from 'package:base': ## ## as.Date, as.Date.numeric ## Loading required package: sandwich ## Warning: package 'sandwich' was built under R version 4.0.5 ## Loading required package: survival ## Warning: package 'survival' was built under R version 4.0.5 library(lavaan) ## Warning: package 'lavaan' was built under R version 4.0.5 ## This is lavaan 0.6-9 ## lavaan is FREE software! Please report any bugs. library(semPlot) ## Warning: package 'semPlot' was built under R version 4.0.3 library(psych) ## Warning: package 'psych' was built under R version 4.0.5 ## ## Attaching package: 'psych' ## The following object is masked from 'package:lavaan': ## ## cor2cov ## The following object is masked from 'package:car': ## ## logit ## The following object is masked from 'package:gtools': ## ## logit ## The following objects are masked from 'package:ggplot2': ## ## %+%, alpha readfile <- function (file1){read_csv (file1, col_types = cols(), locale = readr::locale (encoding = "latin1"), na = c("..", "NA")) %>% gather (starts_with("19"), starts_with("20"), key = "year", value = groupsize) %>% drop_na() %>% mutate (year_n = parse_number (year)) } perc_women <- function(x){ ifelse (length(x) == 2, x[2] / (x[1] + x[2]), NA) } nuts <- read.csv("nuts.csv") %>% mutate(NUTS2_sh = substr(NUTS2, 3, 4)) nuts %>% distinct (NUTS2_en) %>% knitr::kable( booktabs = TRUE, caption = 'Nomenclature des Unités Territoriales Statistiques (NUTS)')
NUTS2_en |
---|
SE11 Stockholm |
SE12 East-Central Sweden |
SE21 Småland and islands |
SE22 South Sweden |
SE23 West Sweden |
SE31 North-Central Sweden |
SE32 Central Norrland |
SE33 Upper Norrland |
# normalize data tonormest <- function(tablearg){ norm_recipe <- recipes::recipe( ~ ., data = tablearg) %>% recipes::step_normalize(recipes::all_numeric()) prepare <- recipes::prep(norm_recipe, training = tablearg) return (recipes::bake(prepare, new_data = tablearg)) } # not in set `%nin%` = Negate(`%in%`) # create a set of arcs not allowed in model createblacklist <- function(coln, exogenous, endogenous = NULL){ rbind( expand.grid(endogenous, coln[coln %nin% endogenous]), expand.grid(coln[coln %nin% exogenous], exogenous)) } # list of models in bnlearn to evaluate bnmodels <- list( h2pc = function(x, blacklist = blacklist) bnlearn::h2pc(x, blacklist = blacklist), hpc = function(x, blacklist = blacklist) bnlearn::hpc(x, blacklist = blacklist), fast.iamb = function(x, blacklist = blacklist) bnlearn::fast.iamb(x, blacklist = blacklist), inter.iamb = function(x, blacklist = blacklist) bnlearn::inter.iamb(x, blacklist = blacklist), si.hiton.pc = function(x, blacklist = blacklist) bnlearn::si.hiton.pc(x, blacklist = blacklist), iamb.fdr = function(x, blacklist = blacklist) bnlearn::iamb.fdr(x, blacklist = blacklist), iamb = function(x, blacklist = blacklist) bnlearn::iamb(x, blacklist = blacklist), gs = function(x, blacklist = blacklist) bnlearn::gs(x, blacklist = blacklist), mmpc = function(x, blacklist = blacklist) bnlearn::mmpc(x, blacklist = blacklist), pc = function(x, blacklist = blacklist) bnlearn::pc.stable(x, blacklist = blacklist), hc = function(x, blacklist = blacklist) bnlearn::hc(x, blacklist = blacklist), tabu = function(x, blacklist = blacklist) bnlearn::tabu(x, blacklist = blacklist), mmhc = function(x, blacklist = blacklist) bnlearn::mmhc(x, blacklist = blacklist), rsmax2 = function(x, blacklist = blacklist) bnlearn::rsmax2(x, blacklist = blacklist)) # evaluate all models on a dataset given a blacklist evaluatemodel <- function(tablearg, blacklist){ evalslfunc <- function(f){ sltree <- f (data.frame(map(tablearg, inttonumeric)), blacklist = blacklist) neltree <- as.graphNEL(sltree) bg <- addBgKnowledge(neltree) if (length(bg) == 0){return (Inf)} g <- dagitty(pcalg2dagitty(t(as(bg, "matrix")), colnames(tablearg), type = "dag")) r <- localTests( g, tablearg, "cis", sample.cov = lavCor(tablearg), sample.nobs = nrow( tablearg ), max.conditioning.variables = 2, R = 100) if (dim(r)[1] == 0){return (0)} return(mean(abs(r$estimate))) } return(data.frame(lapply(bnmodels, evalslfunc))) } # Change all variables of integer type to a numeric type inttonumeric <- function(x){ if(is.integer(x)){ return(as.numeric(x)) } else{ return(x) } } # create a structured model from a dataset and a structured learning algorithm createmodel <- function(tablearg, f, blacklist){ sltree <- f (data.frame(map(tablearg, inttonumeric)), blacklist = blacklist) neltree <- as.graphNEL(sltree) bg <- addBgKnowledge(neltree) g <- dagitty(pcalg2dagitty(t(as(bg, "matrix")), colnames(tablearg), type = "dag")) return(g) } # change the sign from dagitty syntax to lavaan syntax changesign <- function(s){ if(s == "->"){ return("~") } if (s == "--"){ return("~~") } } # create a list of all combinations of variables, one variable from each factor from the factor analysis, and subsets thereof allcombn <- function(gridarg){ allcombn_worker <- function(gridarg){ sumcomb <- vector() for(i in data.frame(t(gridarg))){ subcomb <- combn(i, length(gridarg) - 1) for(j in data.frame(subcomb)){ sumcomb <- rbind(sumcomb, j) } } return(unique(sumcomb)) } sumcomb <- list() sumcomb <- append(sumcomb, split(gridarg, seq(nrow(gridarg)))) subcomb <- allcombn_worker(gridarg) if(length(data.frame(subcomb)) > 0){ sumcomb <- append(sumcomb, allcombn(data.frame(subcomb))) } return(sumcomb) } # convert a list of variables, use each list element as a blacklist, find the structured learning algorithm with the lowest deviance, use that algorithm and create a model # return: # model with the lowest deviance # the deviances for the different models # the name of the algorithm with the lowest deviance combtodag <- function(mycomb, tablearg){ summary_table <- vector() for(i in 1:length(mycomb)){ j <- unlist(mycomb[i]) blacklist <- createblacklist(colnames(tablearg), j) evaluatedmodel <- evaluatemodel(tablearg, blacklist) g <- createmodel(tablearg, match.fun(colnames(sort(evaluatedmodel)) [1]), blacklist) v <- as.character(j) length(v) <- 5 summary_table <- rbind(summary_table, c(as.character(g), t(evaluatedmodel), v, colnames(sort(evaluatedmodel)) [1])) } return(summary_table) } # calculate the causal effect of the variable from to variable to using all models in listofmodels using all minimal adjustment sets that are given by the model calceffect <- function(listofmodels, from, to, tablearg){ summary_table <- vector() expadjsets <- function(adjsets){ paste0(to, " ~ ", from, " + ", paste(unlist(adjsets), collapse = " + ")) } calclmeffect <- function(eq){ (summary(lm(as.formula(eq), tablearg)) %>% broom::tidy())[2,] } for(i in 1:nrow(data.frame(listofmodels))){ g <- as.dagitty(listofmodels[i, 1]) exogenous <- listofmodels[i, 16:20] adjsets <- adjustmentSets(g, from, to) if(is.null(unlist(adjsets)) & (length(adjsets) == 1)){ eq <- paste(to, " ~ ", from) } else { if(length(adjsets) > 0){ eq <- map(adjsets, expadjsets) } else{ eq <- NULL } } if (!is.null(eq)){ mytest <- map(unlist(eq), calclmeffect) } else{ mytest <- NULL } v <- as.character(exogenous) length(v) <- 5 v1 <- t(data.frame(v)) colnames(v1) <- c("1" , "2", "3", "4", "5") if(!is.null(mytest)){ summary_table <- rbind(summary_table, cbind(Reduce('rbind', mytest), t(data.frame(eq)), v1)) } } return(summary_table) } # create an SEM model for each dagitty model and return a table containing the estimated parameters of the fitted model and a variety of fit measures for each model dagitty2sem <- function(mycomb, tablearg){ summary_table <- vector() summary_table2 <- vector() for(i in 1:nrow(data.frame(mycomb))){ g <- as.dagitty(mycomb[i, 1]) exogenous <- mycomb[i, 16:20] fit <- suppressWarnings(sem(paste(dagityy2lavaan(g, exogenous), collapse = ''), data = tablearg)) fit_m <- fitmeasures(fit) v <- as.character(exogenous) length(v) <- 5 sumfit <- parameterEstimates(fit) summary_table2 <- rbind(summary_table2, cbind(sumfit, matrix(v, ncol = 5, nrow = nrow(sumfit), byrow = TRUE))) summary_table <- rbind(summary_table, c(t(fit_m), v)) } summary_table <- cbind(summary_table[,43:47], data.frame(map(data.frame(summary_table[,1:42]), as.numeric))) colnames(summary_table) <- c(c("1" , "2", "3", "4", "5"), names(fit_m)) summary_table2 <- summary_table %>% left_join(summary_table2, by = c("1" , "2", "3", "4", "5")) return(summary_table2) } # convert a model with dagitty syntax to lavaan syntax. Relations between exogenous variables will be replaced by correlations dagityy2lavaan <- function(model_arg, exogenous){ temp <- str_split(model_arg, "\n") %>% unlist() temp <- lapply("->|--", grep, x = temp, value = TRUE) %>% unlist() %>% data.frame() %>% as_tibble() colnames(temp) <- "data" temp <- temp %>% rowwise() %>% mutate(lhs = unlist(str_split(data, " "))[1]) %>% mutate(rhs = unlist(str_split(data, " "))[3]) %>% mutate(sign = unlist(str_split(data, " "))[2]) # one variable can be exogenous but not both endorel <- temp %>% dplyr::filter(!(lhs %in% exogenous & rhs %in% exogenous)) exorel <- temp %>% dplyr::filter(lhs %in% exogenous & rhs %in% exogenous) exorel$sign <- "--" temp <- data.frame(rbind(endorel, exorel)) %>% rowwise() %>% mutate(mydata3 = paste(rhs, changesign(sign), lhs, "\n")) return(temp$mydata3) } filtergt0_3 <- function(loadings){ loadings <- data.frame(loadings) rownames(loadings) <- rownames(factorloadings) rownames(loadings)[which(abs(loadings) > 0.3)] }
The data tables are downloaded from Statistics Sweden. They are saved as a comma-delimited file without heading, http://www.statistikdatabasen.scb.se/pxweb/en/ssd/.
The tables:
UF0506A1_20210926-160849.csv: Population 16-74 years of age by region, highest level of education, age and sex. Year 1985 – 2020 NUTS 2 level 2008- 10 year intervals (16-74)
000000CG_20210926-160057.csv: Average basic salary, monthly salary and women´s salary as a percentage of men´s salary by region, sector, occupational group (SSYK 2012) and sex. Year 2014 – 2020 Monthly salary All sectors.
000000CD_20210926-160259.csv: Average basic salary, monthly salary and women´s salary as a percentage of men´s salary by region, sector, occupational group (SSYK 2012) and sex. Year 2014 – 2020 Number of employees All sectors.
The data is aggregated, the size of each group is in the column groupsize.
I have also included some calculated predictors from the original data.
nremployees: The number of employees in each group defined by ssyk, edulevel, region and year
perc_women: The percentage of women within each group defined by ssyk, edulevel, region and year
perc_women_region: The percentage of women within each group defined by ssyk, year and region
regioneduyears: The average number of education years per capita within each group defined by ssyk, year and region
eduquotient: The quotient between regioneduyears for men and women
salaryquotient: The quotient between salary for men and women within each group defined by ssyk, year and region
perc_women_ssyk_region: The percentage of women within each group defined by ssyk, year and region
numedulevel <- read.csv("edulevel_1.csv") numedulevel[, 2] <- data.frame(c(8, 9, 10, 12, 13, 15, 22, NA)) tb <- readfile("000000CG_20210926-160057.csv") tb <- readfile("000000CD_20210926-160259.csv") %>% left_join(tb, by = c("region", "year", "sex", "sector","occuptional (SSYK 2012)")) tb <- readfile("UF0506A1_20210926-160849.csv") %>% right_join(tb, by = c("region", "year", "sex")) %>% right_join(numedulevel, by = c("level of education" = "level.of.education")) %>% filter(!is.na(eduyears)) %>% mutate(edulevel = `level of education`) %>% group_by(edulevel, region, year, sex, `occuptional (SSYK 2012)`) %>% mutate(groupsize_all_ages = sum(groupsize)) %>% group_by(edulevel, region, year, `occuptional (SSYK 2012)`) %>% mutate (perc_women = perc_women (groupsize_all_ages[1:2])) %>% mutate (nremployees = sum(groupsize.x)) %>% mutate (salary = (groupsize.y[2] * groupsize.x[2] + groupsize.y[1] * groupsize.x[1])/(groupsize.x[2] + groupsize.x[1])) %>% group_by (sex, year, region, `occuptional (SSYK 2012)`) %>% mutate(regioneduyears_sex = sum(groupsize * eduyears) / sum(groupsize)) %>% mutate(regiongroupsize = sum(groupsize)) %>% mutate(nremployees_sex = groupsize.x) %>% group_by(region, year, `occuptional (SSYK 2012)`) %>% mutate (sum_pop = sum(groupsize)) %>% mutate (regioneduyears = sum(groupsize * eduyears) / sum(groupsize)) %>% mutate (perc_women_region = perc_women (regiongroupsize[1:2])) %>% mutate (eduquotient = regioneduyears_sex[2] / regioneduyears_sex[1]) %>% mutate (salary_sex = groupsize.y) %>% mutate (salaryquotient = salary_sex[2] / salary_sex[1]) %>% mutate (perc_women_ssyk_region = perc_women(nremployees_sex[1:2])) %>% left_join(nuts %>% distinct (NUTS2_en, NUTS2_sh), by = c("region" = "NUTS2_en")) %>% drop_na() summary(tb) ## region age level of education sex ## Length:316974 Length:316974 Length:316974 Length:316974 ## Class :character Class :character Class :character Class :character ## Mode :character Mode :character Mode :character Mode :character ## ## ## ## year groupsize year_n sector ## Length:316974 Min. : 0 Min. :2014 Length:316974 ## Class :character 1st Qu.: 2561 1st Qu.:2015 Class :character ## Mode :character Median : 7456 Median :2017 Mode :character ## Mean :11676 Mean :2017 ## 3rd Qu.:16443 3rd Qu.:2019 ## Max. :81358 Max. :2020 ## occuptional (SSYK 2012) groupsize.x year_n.x groupsize.y ## Length:316974 Min. : 100 Min. :2014 Min. : 20200 ## Class :character 1st Qu.: 400 1st Qu.:2015 1st Qu.: 29100 ## Mode :character Median : 1100 Median :2017 Median : 34300 ## Mean : 2893 Mean :2017 Mean : 37470 ## 3rd Qu.: 3000 3rd Qu.:2019 3rd Qu.: 42600 ## Max. :53700 Max. :2020 Max. :139500 ## year_n.y eduyears edulevel groupsize_all_ages ## Min. :2014 Min. : 8.00 Length:316974 Min. : 405 ## 1st Qu.:2015 1st Qu.: 9.00 Class :character 1st Qu.: 24027 ## Median :2017 Median :12.00 Mode :character Median : 56038 ## Mean :2017 Mean :12.71 Mean : 70055 ## 3rd Qu.:2019 3rd Qu.:15.00 3rd Qu.:111943 ## Max. :2020 Max. :22.00 Max. :288426 ## perc_women nremployees salary regioneduyears_sex ## Min. :0.3575 Min. : 600 Min. : 20200 Min. :11.18 ## 1st Qu.:0.4439 1st Qu.: 4800 1st Qu.: 29214 1st Qu.:11.65 ## Median :0.4816 Median : 13080 Median : 34488 Median :11.83 ## Mean :0.4831 Mean : 32797 Mean : 37494 Mean :11.86 ## 3rd Qu.:0.5000 3rd Qu.: 37800 3rd Qu.: 42738 3rd Qu.:12.13 ## Max. :0.6484 Max. :426600 Max. :123796 Max. :12.64 ## regiongroupsize nremployees_sex sum_pop regioneduyears ## Min. :127118 Min. : 100 Min. : 127118 Min. :11.18 ## 1st Qu.:291940 1st Qu.: 400 1st Qu.: 518853 1st Qu.:11.63 ## Median :528643 Median : 1100 Median : 722010 Median :11.85 ## Mean :490383 Mean : 2893 Mean : 878571 Mean :11.86 ## 3rd Qu.:708813 3rd Qu.: 3000 3rd Qu.:1395157 3rd Qu.:12.01 ## Max. :842459 Max. :53700 Max. :1682100 Max. :12.64 ## perc_women_region eduquotient salary_sex salaryquotient ## Min. :0.4831 Min. :1.000 Min. : 20200 Min. :0.6423 ## 1st Qu.:0.4893 1st Qu.:1.020 1st Qu.: 29100 1st Qu.:0.9333 ## Median :0.4949 Median :1.030 Median : 34300 Median :0.9804 ## Mean :0.4945 Mean :1.026 Mean : 37470 Mean :0.9637 ## 3rd Qu.:0.5000 3rd Qu.:1.039 3rd Qu.: 42600 3rd Qu.:1.0000 ## Max. :0.5014 Max. :1.049 Max. :139500 Max. :1.3090 ## perc_women_ssyk_region NUTS2_sh ## Min. :0.009346 Length:316974 ## 1st Qu.:0.384615 Class :character ## Median :0.500000 Mode :character ## Mean :0.518956 ## 3rd Qu.:0.674419 ## Max. :0.945274 tbtemp <- ungroup(tb) %>% dplyr::select(salary, nremployees, year_n, sum_pop, regioneduyears, perc_women_region, salaryquotient, eduquotient, perc_women_ssyk_region, `occuptional (SSYK 2012)`) tb_unique <- unique(tbtemp)
Data is normalised before analysis. In this way, the scale of the variables will not affect the analysis. Data is imputed by replacing NA with the median.
tb_unique_norm <- tb_unique tb_unique_norm <- data.frame(data.matrix(tonormest(tb_unique))) tb_unique_norm <- imputeMissings::impute(tb_unique_norm, object = NULL, method = "median/mode", flag = FALSE)
I will use the package bnlearn to approximate a DAG from the data. In bnlearn, there are several algorithms for this purpose. A way to use prior knowledge together with the algorithms for structured learning in the bnlearn package is to specify a blacklist or a whitelist. Arcs in the whitelist are always included in the network. Arcs in the blacklist are never included in the network. If we don’t know a priori what arcs to include in the blacklist or whitelist then we can evaluate several models with different settings. To limit the number of models to evaluate I will do some assumptions. I will assume that the variation in the model can be expressed by using fewer variables, e.g. Principal Component Analysis, for this application I will use factor analysis. I will assume that the number of exogenous variables in the model is equal to or less than the number of factors suggested by factor analysis. I will use the Psych
package’s fa.parallel
function to determine the number of factors. The warning from the factor analysis is ignored. A better choice of rotation and factoring method could solve this, future improvements. The factors with loading greater than 0.3 are chosen for future processing.
fatest <- fa.parallel(tb_unique_norm, fm = "minres", fa = "fa") ## Warning in fa.stats(r = r, f = f, phi = phi, n.obs = n.obs, np.obs = np.obs, : ## The estimated weights for the factor scores are probably incorrect. Try a ## different factor score estimation method. ## Warning in fac(r = r, nfactors = nfactors, n.obs = n.obs, rotate = rotate, : An ## ultra-Heywood case was detected. Examine the results carefully
## Parallel analysis suggests that the number of factors = 5 and the number of components = NA factoranalysis <- fa(tb_unique_norm, nfactors = fatest$nfact, rotate = "oblimin", fm = "minres") ## Loading required namespace: GPArotation ## Warning in fa.stats(r = r, f = f, phi = phi, n.obs = n.obs, np.obs = np.obs, : ## The estimated weights for the factor scores are probably incorrect. Try a ## different factor score estimation method. ## Warning in fac(r = r, nfactors = nfactors, n.obs = n.obs, rotate = rotate, : An ## ultra-Heywood case was detected. Examine the results carefully print(factoranalysis$loadings, cutoff = 0.3) ## ## Loadings: ## MR1 MR2 MR3 MR4 MR5 ## salary -0.979 ## nremployees 0.466 ## year_n 0.998 ## sum_pop 0.996 ## regioneduyears 0.570 ## perc_women_region 0.906 ## salaryquotient ## eduquotient -0.969 ## perc_women_ssyk_region 0.998 ## occuptional...SSYK.2012. 0.779 ## ## MR1 MR2 MR3 MR4 MR5 ## SS loadings 1.949 1.702 1.676 1.077 1.059 ## Proportion Var 0.195 0.170 0.168 0.108 0.106 ## Cumulative Var 0.195 0.365 0.533 0.640 0.746 factorloadings <- print(factoranalysis$loadings, cutoff = 0.3) ## ## Loadings: ## MR1 MR2 MR3 MR4 MR5 ## salary -0.979 ## nremployees 0.466 ## year_n 0.998 ## sum_pop 0.996 ## regioneduyears 0.570 ## perc_women_region 0.906 ## salaryquotient ## eduquotient -0.969 ## perc_women_ssyk_region 0.998 ## occuptional...SSYK.2012. 0.779 ## ## MR1 MR2 MR3 MR4 MR5 ## SS loadings 1.949 1.702 1.676 1.077 1.059 ## Proportion Var 0.195 0.170 0.168 0.108 0.106 ## Cumulative Var 0.195 0.365 0.533 0.640 0.746 mygrid <- expand.grid(map(data.frame(factorloadings[,1:5]), filtergt0_3)) mygrid %>% knitr::kable( booktabs = TRUE, caption = 'Table of proposed exogenous variables to evaluate')
MR1 | MR2 | MR3 | MR4 | MR5 |
---|---|---|---|---|
perc_women_region | nremployees | salary | year_n | perc_women_ssyk_region |
eduquotient | nremployees | salary | year_n | perc_women_ssyk_region |
perc_women_region | sum_pop | salary | year_n | perc_women_ssyk_region |
eduquotient | sum_pop | salary | year_n | perc_women_ssyk_region |
perc_women_region | regioneduyears | salary | year_n | perc_women_ssyk_region |
eduquotient | regioneduyears | salary | year_n | perc_women_ssyk_region |
perc_women_region | nremployees | occuptional…SSYK.2012. | year_n | perc_women_ssyk_region |
eduquotient | nremployees | occuptional…SSYK.2012. | year_n | perc_women_ssyk_region |
perc_women_region | sum_pop | occuptional…SSYK.2012. | year_n | perc_women_ssyk_region |
eduquotient | sum_pop | occuptional…SSYK.2012. | year_n | perc_women_ssyk_region |
perc_women_region | regioneduyears | occuptional…SSYK.2012. | year_n | perc_women_ssyk_region |
eduquotient | regioneduyears | occuptional…SSYK.2012. | year_n | perc_women_ssyk_region |
I will start by creating all possible sets and subsets from the set of five exogenous variables that were selected by the factor analysis. For each set, I will create a model from a set of structured learning algorithms from the bnlearn package. Each algorithm is evaluated against how well it minimizes the deviations from the testable implications in the model and the dataset. The function localTests from the dagitty package is used to get a numeric value of the testable implications. The mean deviation from all testable implications by localTests is used. This could favour more complex models since there are fewer testable implications in complex than in simple models. The version of localTests in my test uses a combination of categorical and continuous data. A more advanced algorithm could have been used to select the model that minimizes the deviation. The models so far are created in dagitty syntax.
From the dagitty syntax, I will create a Structural Equation Model in lavaan for each model created in the earlier step. Each SEM model is evaluated for a variety of fit measures to assess the global fit of the latent variable model. No latent variables are evaluated at this stage but you could perhaps imagine that there is a correspondence between the factor analysis and some unmeasured latent variables. The model parameters for each model is also stored for later analysis.
The table of models and the table of evaluated models are joined to allow extra comparisons.
Since the generation of dagitty models takes a while I have prepared that table in a file.
list_of_exogenous_variables <- allcombn(mygrid) #table_of_dagitty_models <- combtodag(list_of_exogenous_variables, tb_unique_norm) table_of_dagitty_models <- read.csv("table_of_dagitty_models.csv") table_of_dagitty_models <- table_of_dagitty_models[,2:22] table_of_sem_models <- dagitty2sem(table_of_dagitty_models, tb_unique_norm) table_of_dagitty_models_df <- data.frame(table_of_dagitty_models) colnames(table_of_dagitty_models_df) <- c("model", names(bnmodels), c("1" , "2", "3", "4", "5", "algorithm")) dagitty_and_sem_table <- table_of_dagitty_models_df %>% left_join(table_of_sem_models, by = c("1" , "2", "3", "4", "5")) ggplot(dagitty_and_sem_table) + geom_point(aes(x = pvalue.x, y = rmsea))
dagitty_and_sem_table %>% mutate(deviance = as.numeric(pmin(h2pc, hpc, fast.iamb, inter.iamb, si.hiton.pc, iamb.fdr, iamb, gs, mmpc, pc, hc, tabu, mmhc, rsmax2))) %>% ggplot() + geom_point(aes(x = aic, y = deviance))
From the table, we can examine how many of the models contained an arc, i.e. a relation from one variable to another. We find that the direction of the relation from quotient between the average number of education years for men and women to quotient between salary for men and women is found in 121 out of 143 tested models. The estimation of the relationship is 0.32 standard units. Only the models with a pvalue less than 0.05 are counted.
If we use the ten (arbitrary number, could be optimized) arcs that occur are most frequent in the models that I have analysed and use those arcs to create a whitelist, i.e. arcs that must be present in the model, when estimating a new model with the hills climbing algorithm we get a model that can be used to approximate the causal effects that can be estimated from the data. The plot shows the model. In this model year and percentage of women in the ssyk are exogenous variables and the rest of the variables are endogenous. When looking at the model’s parameter values and sorting it by the highest effect we find that the effect of per cent women in the region on quotient between the average number of education years for men and women is the highest of all effects between continuous variables, 79 out of 143 models has this direction of this relation.
temp <- combn(colnames(tb_unique_norm), 2) list_of_var_combinations <- cbind(temp, rbind(temp[2,], temp[1,])) summary_table <- vector() for(i in data.frame(list_of_var_combinations)){ est <- dagitty_and_sem_table %>% filter(lhs == i[1], rhs == i[2], pvalue.x < 0.05) %>% dplyr::select(est) summary_table <- rbind( summary_table, c(i, (t(summary(est))), nrow(est), sd(t(est)))) } summary_table <- unique(cbind(summary_table[,1:8], data.frame(map(data.frame(summary_table[,9:10]), as.numeric)))) %>% arrange(-X1) summary_table[1:10,] %>% select(`1`, `2`, X1) %>% rename(lhs = `1`) %>% rename(rhs = `2`) %>% rename(nr_of_models = X1) %>% knitr::kable( booktabs = TRUE, caption = 'The ten most common arcs of all 143 models')
lhs | rhs | nr_of_models |
---|---|---|
salaryquotient | eduquotient | 121 |
salaryquotient | perc_women_ssyk_region | 111 |
salaryquotient | salary | 110 |
regioneduyears | year_n | 110 |
nremployees | perc_women_ssyk_region | 104 |
salaryquotient | year_n | 104 |
salary | year_n | 98 |
occuptional…SSYK.2012. | perc_women_ssyk_region | 98 |
salaryquotient | sum_pop | 97 |
regioneduyears | sum_pop | 95 |
whitelist <- summary_table[1:10, 2:1] hctree <- hc(tb_unique_norm, whitelist = whitelist) neltree <- as.graphNEL(hctree) bg <- addBgKnowledge(neltree) if (length(bg) == 0){ return (Inf) } g <- dagitty(pcalg2dagitty(t(as(bg, "matrix")), colnames(tb_unique_norm), type = "dag")) fit <- sem(paste(dagityy2lavaan(g, NULL), collapse = ''), data = tb_unique_norm) ## Warning in lav_data_full(data = data, group = group, cluster = cluster, : lavaan ## WARNING: some observed variances are (at least) a factor 1000 times larger than ## others; use varTable(fit) to investigate semPaths(fit, 'std', 'est', curveAdjacent = TRUE, style = "lisrel")
r <- localTests( g, tb_unique_norm, "cis", sample.cov = lavCor(tb_unique_norm), sample.nobs = nrow( tb_unique_norm ), max.conditioning.variables = 2, R = 100) ## Warning in lav_data_full(data = data, group = group, cluster = cluster, : lavaan ## WARNING: some observed variances are (at least) a factor 1000 times larger than ## others; use varTable(fit) to investigate parameterEstimates(fit) %>% select(lhs, op, rhs, est, pvalue) %>% arrange(-abs(est)) %>% knitr::kable( booktabs = TRUE, caption = 'Parameters for the approximate model sorted in falling effect size')
lhs | op | rhs | est | pvalue |
---|---|---|---|---|
occuptional…SSYK.2012. | ~~ | occuptional…SSYK.2012. | 515.4245404 | 0.0000000 |
occuptional…SSYK.2012. | ~ | salary | -28.6174604 | 0.0000000 |
occuptional…SSYK.2012. | ~ | perc_women_ssyk_region | -7.9191011 | 0.0000000 |
occuptional…SSYK.2012. | ~ | year_n | 5.1547157 | 0.0000000 |
occuptional…SSYK.2012. | ~ | perc_women_region | 2.9988570 | 0.0000000 |
occuptional…SSYK.2012. | ~ | sum_pop | 1.7031739 | 0.0000006 |
year_n | ~~ | year_n | 0.9997846 | NA |
perc_women_ssyk_region | ~~ | perc_women_ssyk_region | 0.9997846 | NA |
sum_pop | ~~ | sum_pop | 0.9770425 | 0.0000000 |
perc_women_region | ~~ | perc_women_region | 0.9603644 | 0.0000000 |
salary | ~~ | salary | 0.9556355 | 0.0000000 |
eduquotient | ~ | perc_women_region | -0.9015448 | 0.0000000 |
regioneduyears | ~ | sum_pop | 0.8082339 | 0.0000000 |
salaryquotient | ~~ | salaryquotient | 0.7302152 | 0.0000000 |
nremployees | ~~ | nremployees | 0.7295975 | 0.0000000 |
regioneduyears | ~ | eduquotient | -0.6494568 | 0.0000000 |
regioneduyears | ~~ | regioneduyears | 0.4814185 | 0.0000000 |
eduquotient | ~ | sum_pop | 0.4493408 | 0.0000000 |
salaryquotient | ~ | salary | -0.3776151 | 0.0000000 |
nremployees | ~ | sum_pop | 0.3740575 | 0.0000000 |
salaryquotient | ~ | eduquotient | -0.3242296 | 0.0000000 |
regioneduyears | ~ | year_n | 0.3050356 | 0.0000000 |
regioneduyears | ~ | perc_women_region | -0.2773857 | 0.0000000 |
nremployees | ~ | eduquotient | 0.2212280 | 0.0000000 |
salary | ~ | perc_women_ssyk_region | -0.1546639 | 0.0000000 |
salaryquotient | ~ | sum_pop | -0.1535919 | 0.0000000 |
perc_women_region | ~ | sum_pop | 0.1529078 | 0.0000000 |
sum_pop | ~ | salary | 0.1508210 | 0.0000000 |
salaryquotient | ~ | year_n | 0.1479586 | 0.0000000 |
nremployees | ~ | perc_women_ssyk_region | 0.1465144 | 0.0000000 |
salary | ~ | year_n | 0.1421677 | 0.0000000 |
nremployees | ~ | perc_women_region | 0.1329411 | 0.0002882 |
eduquotient | ~~ | eduquotient | 0.1082249 | 0.0000000 |
regioneduyears | ~ | salary | -0.0952185 | 0.0000000 |
nremployees | ~ | salary | -0.0951437 | 0.0000015 |
perc_women_region | ~ | perc_women_ssyk_region | -0.0790279 | 0.0000001 |
perc_women_region | ~ | year_n | -0.0767289 | 0.0000001 |
salaryquotient | ~ | perc_women_ssyk_region | 0.0693828 | 0.0000003 |
perc_women_region | ~ | salary | 0.0474522 | 0.0014239 |
eduquotient | ~ | year_n | 0.0283044 | 0.0000000 |
regioneduyears | ~~ | salaryquotient | -0.0201865 | 0.0204311 |
nremployees | ~~ | salaryquotient | -0.0040980 | 0.7020812 |
nremployees | ~ | occuptional…SSYK.2012. | 0.0037042 | 0.0000000 |
salaryquotient | ~ | occuptional…SSYK.2012. | -0.0032636 | 0.0000000 |
regioneduyears | ~ | occuptional…SSYK.2012. | -0.0029127 | 0.0000000 |
nremployees | ~~ | regioneduyears | -0.0017782 | 0.8380207 |
perc_women_ssyk_region | ~~ | year_n | -0.0005915 | NA |
plotLocalTestResults( r )
ggplot(data = tb_unique_norm, mapping = aes(x = eduquotient, y = salaryquotient)) + geom_boxplot(mapping = aes(group = cut_width(eduquotient, 0.4)))
ggplot(tb_unique_norm) + geom_point(aes(x = perc_women_region, y = eduquotient))
If you know the DAG it is possible to identify the sets of covariates that allow unbiased estimation of causal effects with the function adjustmentSets in the dagitty package. I will calculate the possible adjustment sets for all models created in earlier steps and compare them. Let’s start by calculating the causal effect of the quotient between the average number of education years for men and women on the quotient between salary for men and women.
causaleffect <- calceffect(table_of_dagitty_models, "eduquotient", "salaryquotient", tb_unique_norm) causaleffect_df <- data.frame(map(causaleffect, unlist)) colnames(causaleffect_df) <- colnames(causaleffect) dagitty_sem_and_causal_table <- dagitty_and_sem_table %>% left_join(causaleffect_df, by = c("1" , "2", "3", "4", "5")) dagitty_sem_and_causal_table %>% filter(lhs == "salaryquotient", rhs == "eduquotient") %>% ggplot() + geom_point(aes(x = estimate, y = pvalue.x, color = aic))
dagitty_sem_and_causal_table %>% mutate(modelnumber = as.integer(factor(`t(data.frame(eq))`))) %>% filter(lhs == "salaryquotient", rhs == "eduquotient") %>% ggplot() + geom_point(aes(x = modelnumber, y = estimate, color = pvalue.x))
dagitty_sem_and_causal_table %>% filter(lhs == "salaryquotient", rhs == "eduquotient") %>% mutate(modelnumber = as.integer(factor(`t(data.frame(eq))`))) %>% mutate(linear_equation_with_covariates = `t(data.frame(eq))`) %>% group_by(modelnumber) %>% mutate(frequency = n()) %>% arrange(modelnumber) %>% select(linear_equation_with_covariates, modelnumber, estimate, frequency) %>% unique() %>% knitr::kable( booktabs = TRUE, caption = 'Table showing the covariates needed for different models to calculate the effect')
linear_equation_with_covariates | modelnumber | estimate | frequency |
---|---|---|---|
salaryquotient ~ eduquotient | 1 | -0.3475676 | 13 |
salaryquotient ~ eduquotient + nremployees | 2 | -0.3438332 | 5 |
salaryquotient ~ eduquotient + nremployees + occuptional…SSYK.2012. | 3 | -0.3242904 | 1 |
salaryquotient ~ eduquotient + nremployees + occuptional…SSYK.2012. + perc_women_region + sum_pop + year_n | 4 | -0.4027556 | 31 |
salaryquotient ~ eduquotient + nremployees + occuptional…SSYK.2012. + salary + sum_pop + year_n | 5 | -0.3230239 | 7 |
salaryquotient ~ eduquotient + nremployees + perc_women_region + salary + sum_pop | 6 | -0.3621844 | 2 |
salaryquotient ~ eduquotient + nremployees + perc_women_region + sum_pop + year_n | 7 | -0.4259852 | 10 |
salaryquotient ~ eduquotient + nremployees + perc_women_ssyk_region + salary | 8 | -0.3335892 | 1 |
salaryquotient ~ eduquotient + occuptional…SSYK.2012. | 9 | -0.3355981 | 1 |
salaryquotient ~ eduquotient + occuptional…SSYK.2012. + perc_women_region + regioneduyears | 10 | -0.5848968 | 7 |
salaryquotient ~ eduquotient + occuptional…SSYK.2012. + perc_women_ssyk_region + salary + sum_pop + year_n | 11 | -0.3242342 | 31 |
salaryquotient ~ eduquotient + occuptional…SSYK.2012. + perc_women_ssyk_region + year_n | 12 | -0.3548155 | 2 |
salaryquotient ~ eduquotient + occuptional…SSYK.2012. + regioneduyears | 13 | -0.3497428 | 3 |
salaryquotient ~ eduquotient + occuptional…SSYK.2012. + regioneduyears + year_n | 14 | -0.3676810 | 1 |
salaryquotient ~ eduquotient + occuptional…SSYK.2012. + year_n | 15 | -0.3443816 | 3 |
salaryquotient ~ eduquotient + perc_women_region + perc_women_ssyk_region + sum_pop + year_n | 16 | -0.4063135 | 2 |
salaryquotient ~ eduquotient + perc_women_region + regioneduyears + sum_pop + year_n | 17 | -0.4235753 | 17 |
salaryquotient ~ eduquotient + perc_women_region + salary + sum_pop + year_n | 18 | -0.3932932 | 4 |
salaryquotient ~ eduquotient + perc_women_region + sum_pop | 19 | -0.3868834 | 11 |
salaryquotient ~ eduquotient + perc_women_region + sum_pop + year_n | 20 | -0.4106547 | 11 |
salaryquotient ~ eduquotient + perc_women_ssyk_region + salary + sum_pop | 21 | -0.2998376 | 3 |
salaryquotient ~ eduquotient + perc_women_ssyk_region + salary + sum_pop + year_n | 22 | -0.3135959 | 23 |
salaryquotient ~ eduquotient + perc_women_ssyk_region + sum_pop | 23 | -0.2934366 | 10 |
salaryquotient ~ eduquotient + perc_women_ssyk_region + year_n | 24 | -0.3671626 | 5 |
salaryquotient ~ eduquotient + regioneduyears | 25 | -0.3622676 | 2 |
salaryquotient ~ eduquotient + salary + sum_pop | 26 | -0.2923565 | 2 |
salaryquotient ~ eduquotient + salary + sum_pop + year_n | 27 | -0.3064032 | 4 |
salaryquotient ~ eduquotient + sum_pop | 28 | -0.2821068 | 4 |
salaryquotient ~ eduquotient + year_n | 29 | -0.3567006 | 5 |
ggplot(tb_unique_norm) + geom_point(aes(x = eduquotient, y = salaryquotient, color = perc_women_region))
As a second example, I will calculate the causal effect of per cent women in the region on quotient between the average number of education years for men and women.
causaleffect <- calceffect(table_of_dagitty_models, "perc_women_region", "eduquotient", tb_unique_norm) causaleffect_df <- data.frame(map(causaleffect, unlist)) colnames(causaleffect_df) <- colnames(causaleffect) dagitty_sem_and_causal_table <- dagitty_and_sem_table %>% left_join(causaleffect_df, by = c("1" , "2", "3", "4", "5")) dagitty_sem_and_causal_table %>% filter(lhs == "eduquotient", rhs == "perc_women_region") %>% ggplot() + geom_point(aes(x = estimate, y = pvalue.x, color = aic))
dagitty_sem_and_causal_table %>% mutate(modelnumber = as.integer(factor(`t(data.frame(eq))`))) %>% filter(lhs == "eduquotient", rhs == "perc_women_region") %>% ggplot() + geom_point(aes(x = modelnumber, y = estimate, color = aic))
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.