Simulating data with Bayesian networks
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Bayesian networks are really useful for many applications and one of those is to simulate new data. Bayes nets represent data as a probabilistic graph and from this structure it is then easy to simulate new data. This post will demonstrate how to do this with bnlearn.
Fit a Bayesian network
Before simulating new data we need a model to simulate data from. Using the same Australian Institute of Sport dataset from my previous post on Bayesian networks we’ll set up a simple model. For convenience I’ll subset the data to 6 variables.
library(DAAG) library(tidyverse) # ais data from the DAAG package ais_sub <- ais %>% dplyr::select(sex, sport, pcBfat, hg, rcc, hc)
The variables sex and sport are pretty straight forward. The remaining four are
- pcBfat – percent of body fat
- hg – hemoglobin concentration
- rcc – red cell count
- hc – hematocrit percentage
I’ve allowed the data to learn the structure of the network, bar one arc, sport to percentage of body fat. The details are not shown here, but check out the post above on how to fit the structure algorithmically (also I suggest heading to the bnlearn doco which has great examples of a number of networks that can be downloaded). The structure is defined by the string and converted to a bn
class object.
library(visNetwork) library(bnlearn) # set structure bn_struct <- model2network("[sex][sport|sex][hg|sex][pcBfat|sex:sport][hc|hg:pcBfat][rcc|hc]") bn_struct
## ## Random/Generated Bayesian network ## ## model: ## [sex][hg|sex][sport|sex][pcBfat|sex:sport][hc|hg:pcBfat][rcc|hc] ## nodes: 6 ## arcs: 7 ## undirected arcs: 0 ## directed arcs: 7 ## average markov blanket size: 2.67 ## average neighbourhood size: 2.33 ## average branching factor: 1.17 ## ## generation algorithm: Empty
# plot network - code for function at end of post plot.network(bn_struct)
Now that we have set the structure of the model it is fit to the data with bn.fit
using maximum likelihood estimation.
bn_mod <- bn.fit(bn_struct, data = ais_sub, method = "mle")
The output is quite detailed so it’s worth running bn_mod
to view the conditional probability tables and Gaussian distributions.
Simulate data
New data is simulated from a Bayes net by first sampling from each of the root nodes, in this case sex. Then followed by the children conditional on their parent(s) (e.g. sport | sex and hg | sex) until data for all nodes has been drawn. The numbers on the nodes below indicate the sequence in which the data is simulated, noting that rcc is the terminal node.
From this point it’s easy to simulate new data using rbn
. Here we simulate a dataset the same size as the original, but you can simulate as many rows as needed.
ais_sim <- rbn(bn_mod, 202) head(ais_sim)
## hc hg pcBfat rcc sex sport ## 1 38.00754 12.43565 13.772499 3.917082 f Swim ## 2 45.54250 15.79388 13.586402 4.824458 m Field ## 3 49.87429 17.31542 5.308675 5.814398 m T_400m ## 4 49.05707 17.19443 9.230973 5.337367 m W_Polo ## 5 37.66307 12.99088 13.685909 4.020170 f Gym ## 6 42.33715 14.62894 15.147165 4.440556 f Netball
Done. We now have a fully synthetic dataset which retains the properties of the original data. And it only took a few lines of code.
An important property of generating synthetic data is that it doesn’t use real data to do so, meaning any predictors need to be simulated first (my post on the synthpop package explains this in more detail). This property is retained since the data is generated sequentially as per the structure of network. Also, when using synthpop the order in which the variables are simulated needs to be set. The order can alter the accuracy of simulated dataset and so it’s important to spend the time to get it right. For a Bayesian network the order is determined by the structure, so in effect this step is already done.
Compare original and simulated datasets
The original and simulated datasets are compared in a couple of ways 1) observing the distributions of the variables 2) comparing the output from various models and 3) comparing conditional probability queries. The third test is more of a sanity check. If the data is generated from the original Bayes net then a new one fit on the simulated data should be approximately the same. The more rows we generate the closer the parameters will be to the original values.
The variable distributions are very close to the original with only a small amount of variation, mostly observed in sport. Red cell count may have a slight bi-modal distribution but in most part it’s a good fit. This amount of variation is reasonable since there are only 202 simulated observations. Simulating more rows will be a closer fit but there are often practical considerations for retaining the same size dataset.
For the second check, two linear models are fit to the original and simulated data to predict hematocrit levels with sex, hemoglobin concentration, percentage of body fat and red cell count as predictors. Sport was left out of the model since it was not a strong predictor of hc and only increased the error.
# fit models glm_og <- glm(hc ~ hg + rcc + pcBfat + sex, data = ais_sub) glm_sim <- glm(hc ~ hg + rcc + pcBfat + sex, data = ais_sim) # print summary summary(glm_og)
## ## Call: ## glm(formula = hc ~ hg + rcc + pcBfat + sex, data = ais_sub) ## ## Deviance Residuals: ## Min 1Q Median 3Q Max ## -3.9218 -0.4868 0.0523 0.5470 2.8983 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 5.15656 1.04109 4.953 1.57e-06 *** ## hg 1.64639 0.11520 14.291 < 2e-16 *** ## rcc 3.04366 0.31812 9.568 < 2e-16 *** ## pcBfat -0.02271 0.01498 -1.517 0.131 ## sexm -0.20182 0.23103 -0.874 0.383 ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## (Dispersion parameter for gaussian family taken to be 0.888242) ## ## Null deviance: 2696.92 on 201 degrees of freedom ## Residual deviance: 174.98 on 197 degrees of freedom ## AIC: 556.25 ## ## Number of Fisher Scoring iterations: 2
summary(glm_sim)
## ## Call: ## glm(formula = hc ~ hg + rcc + pcBfat + sex, data = ais_sim) ## ## Deviance Residuals: ## Min 1Q Median 3Q Max ## -2.2117 -0.6232 0.0089 0.5910 2.0073 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 5.318057 0.953131 5.580 7.9e-08 *** ## hg 1.633765 0.107235 15.235 < 2e-16 *** ## rcc 2.905111 0.299397 9.703 < 2e-16 *** ## pcBfat -0.001506 0.014768 -0.102 0.919 ## sexm 0.319853 0.220904 1.448 0.149 ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## (Dispersion parameter for gaussian family taken to be 0.7520993) ## ## Null deviance: 3014.20 on 201 degrees of freedom ## Residual deviance: 148.16 on 197 degrees of freedom ## AIC: 522.64 ## ## Number of Fisher Scoring iterations: 2
The coefficients and test statistics of the models are very similar, so both datasets result in the same conclusions. Percent of body fat is the least accurate but still make the same conclusion. In practice you should fit more models to assess the quality of the simulated data.
As mentioned the third is more of a sanity check but it is also a good demonstration of the process. By fitting the same structure to the simulated data we expect to estimate the same parameters and calculate very similar conditional probabilities. Here we simulate 20 000 observations to better estimate the parameters. The conditional probability for the athletes red cell count given the sport they compete in i.e. what is the probability the athletes red cell count will be greater than where is the 33rd and 66th percentile?
library(progress) # fit bayes net with the same structure using simulated ais_sim <- rbn(bn_mod, 2e4) bn_mod_sim <- bn.fit(bn_struct, ais_sim, method = "mle") # sport - mcpquery function at end of post - it generalises bnlearn::cpquery # rcc is continuous so we're calculating the probability rcc > some value, here the 33rd and 66th percentile # the function replaces xx and yy with actually values and evaluates the text given <- "sport == 'xx'" event <- "rcc > yy" vals <- as.matrix(expand.grid(sport = unique(ais_sub$sport), rcc = quantile(ais_sub$rcc, c(0.33, 0.66)))) orig <- mcpquery(bn_mod, values = vals, token = c("xx", "yy"), event, given, n = 1e6) %>% spread(rcc, cp)
## P( rcc > yy | sport == 'xx' )
sim <- mcpquery(bn_mod_sim, values = vals, token = c("xx", "yy"), event, given, n = 1e6) %>% spread(rcc, cp)
## P( rcc > yy | sport == 'xx' )
# join for display left_join(orig, sim, by = "sport", suffix = c("_orig", "_sim"))
## # A tibble: 10 x 5 ## sport `4.4533_orig` `4.9466_orig` `4.4533_sim` `4.9466_sim` ## <chr> <dbl> <dbl> <dbl> <dbl> ## 1 B_Ball 0.687 0.310 0.682 0.302 ## 2 Field 0.764 0.386 0.768 0.384 ## 3 Gym 0.477 0.0658 0.475 0.0717 ## 4 Netball 0.444 0.0584 0.442 0.0575 ## 5 Row 0.652 0.270 0.654 0.271 ## 6 Swim 0.752 0.370 0.758 0.376 ## 7 T_400m 0.767 0.388 0.769 0.386 ## 8 T_Sprnt 0.822 0.449 0.826 0.445 ## 9 Tennis 0.640 0.251 0.639 0.249 ## 10 W_Polo 0.945 0.571 0.944 0.566
The conditional probabilities from the simulated data are very close to the original as expected. Now we can be confident that our simulated data can be used as an alternative to the original.
Impute data
Another useful property of Bayes nets is to impute missing values. This is easily done using impute
. We’ll remove 25% of the observations from variables hg and hc, and allow the Bayes net to impute them.
ais_miss <- ais_sub miss_id <- sample(1:nrow(ais_sub), 50) ais_miss[miss_id, c("hg", "hc")] <- NA # table counting the NA values in hg and hc apply(ais_miss[,c("hg", "hc")], 2, function(x) table(is.na(x)))
## hg hc ## FALSE 152 152 ## TRUE 50 50
The table confirms there are 50 missing observations from hemoglobin and hematocrit variables. Now impute using Bayesian likelihood weighting.
ais_imp <- impute(bn_mod, data = ais_miss, method = "bayes-lw")
Plotting the imputed against the true values shows the Bayes net imputed the missing values quite well.
I’ve only tested and shown two variables but the others would perform similarly for the subset of data I have chosen. This data is normally distributed so I expected it to work well, however if your data has more complex relationships you’ll need to be more rigorous with defining the structure.
Code bits
# plot network function plot.network <- function(structure, ht = "400px", cols = "darkturquoise", labels = nodes(structure)){ if(is.null(labels)) labels <- rep("", length(nodes(structure))) nodes <- data.frame(id = nodes(structure), label = labels, color = cols, shadow = TRUE ) edges <- data.frame(from = structure$arcs[,1], to = structure$arcs[,2], arrows = "to", smooth = FALSE, shadow = TRUE, color = "black") return(visNetwork(nodes, edges, height = ht, width = "100%")) } # conditional probability query functions mcpquery <- function(mod, values, token, event, given, ns = 1e4){ cat("P(", event, "|", given, ")\n") UseMethod("mcpquery", values) } # numeric mcpquery.numeric <- function(mod, values, token, event, given, ns = 1e4){ y <- rep(NA, length(values)) pb <- progress_bar$new( format = "time :elapsedfull // eta :eta // :k of :n // P( :event | :given )", clear = FALSE, total = length(values)) for(k in 1:length(values)){ givenk <- gsub(token, values[k], given) eventk <- gsub(token, values[k], event) pb$tick(token = list(k = k, n = length(values), event = eventk, given = givenk)) y[k] <- eval(parse(text = paste0("cpquery(mod,", eventk, ",", givenk, ", n = ", ns, ", method = 'ls')"))) } return(tibble(values = values, cp = y) %>% arrange(desc(cp))) } # character mcpquery.character <- function(mod, values, token, event, given, ns = 1e4){ y <- rep(NA, length(values)) pb <- progress_bar$new( format = "time :elapsedfull // eta :eta // :k of :n // P( :event | :given )", clear = FALSE, total = length(values)) for(k in 1:length(values)){ givenk <- gsub(token, values[k], given) eventk <- gsub(token, values[k], event) pb$tick(token = list(k = k, n = length(values), event = eventk, given = givenk)) y[k] <- eval(parse(text = paste0("cpquery(mod,", eventk, ",", givenk, ", n = ", ns, ", method = 'ls')"))) } return(tibble(values = values, cp = y) %>% arrange(desc(cp))) } # matrix mcpquery.matrix <- function(mod, values, token, event, given, ns = 1e4){ n <- nrow(values) y <- rep(NA, n) pb <- progress_bar$new( format = "time :elapsedfull // eta :eta // :k of :n // P( :event | :given )", clear = FALSE, total = n) for(k in 1:n){ givenk <- given eventk <- event for(j in 1:ncol(values)){ givenk <- gsub(token[j], values[k,j], givenk) eventk <- gsub(token[j], values[k,j], eventk) } pb$tick(token = list(k = k, n = n, event = eventk, given = givenk)) y[k] <- eval(parse(text = paste0("cpquery(mod,", eventk, ",", givenk, ", n = ", ns, ", method = 'ls')"))) } out <- as.tibble(values) %>% bind_cols(tibble(cp = y)) colnames(out)[1:ncol(values)] <- colnames(values) return(out) } # compare the synthetic and original data frames df <- ais_sub %>% mutate(type = "orig") %>% bind_rows( rbn(bn_mod, 202) %>% mutate(type = "sim") ) # %>% gg_list <- list() grp_var <- "type" vars <- colnames(df)[colnames(df) != grp_var] for(k in 1:length(vars)){ var_k <- vars[k] gg_list[[k]] <- ggplot(df, aes_string(x = var_k, fill = grp_var, col = grp_var)) if(is.numeric(df[[var_k]])){ gg_list[[k]] <- gg_list[[k]] + geom_density(alpha = 0.85, size = 0) }else{ gg_list[[k]] <- gg_list[[k]] + geom_bar(position = "dodge") } gg_list[[k]] <- gg_list[[k]] + theme( axis.text.x = element_text(angle = 90), axis.title.x = element_blank() ) + labs(title = var_k) }
The post Simulating data with Bayesian networks appeared first on Daniel Oehm | Gradient Descending.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.