Site icon R-bloggers

Simulating data with Bayesian networks

[This article was first published on R – Daniel Oehm | Gradient Descending, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Bayesian networks are really useful for many applications and one of those is to simulate new data. Bayes nets represent data as a probabilistic graph and from this structure it is then easy to simulate new data. This post will demonstrate how to do this with bnlearn.

Fit a Bayesian network

Before simulating new data we need a model to simulate data from. Using the same Australian Institute of Sport dataset from my previous post on Bayesian networks we’ll set up a simple model. For convenience I’ll subset the data to 6 variables.

library(DAAG)
library(tidyverse)

# ais data from the DAAG package
ais_sub <- ais %>% dplyr::select(sex, sport, pcBfat, hg, rcc, hc)

The variables sex and sport are pretty straight forward. The remaining four are

I’ve allowed the data to learn the structure of the network, bar one arc, sport to percentage of body fat. The details are not shown here, but check out the post above on how to fit the structure algorithmically (also I suggest heading to the bnlearn doco which has great examples of a number of networks that can be downloaded). The structure is defined by the string and converted to a bn class object.

library(visNetwork)
library(bnlearn)

# set structure
bn_struct <- model2network("[sex][sport|sex][hg|sex][pcBfat|sex:sport][hc|hg:pcBfat][rcc|hc]")
bn_struct

## 
##   Random/Generated Bayesian network
## 
##   model:
##    [sex][hg|sex][sport|sex][pcBfat|sex:sport][hc|hg:pcBfat][rcc|hc] 
##   nodes:                                 6 
##   arcs:                                  7 
##     undirected arcs:                     0 
##     directed arcs:                       7 
##   average markov blanket size:           2.67 
##   average neighbourhood size:            2.33 
##   average branching factor:              1.17 
## 
##   generation algorithm:                  Empty

# plot network - code for function at end of post
plot.network(bn_struct)

Now that we have set the structure of the model it is fit to the data with bn.fit using maximum likelihood estimation.

bn_mod <- bn.fit(bn_struct, data = ais_sub, method = "mle")

The output is quite detailed so it’s worth running bn_mod to view the conditional probability tables and Gaussian distributions.

Simulate data

New data is simulated from a Bayes net by first sampling from each of the root nodes, in this case sex. Then followed by the children conditional on their parent(s) (e.g. sport | sex and hg | sex) until data for all nodes has been drawn. The numbers on the nodes below indicate the sequence in which the data is simulated, noting that rcc is the terminal node.

From this point it’s easy to simulate new data using rbn. Here we simulate a dataset the same size as the original, but you can simulate as many rows as needed.

ais_sim <- rbn(bn_mod, 202)
head(ais_sim)

##         hc       hg    pcBfat      rcc sex   sport
## 1 38.00754 12.43565 13.772499 3.917082   f    Swim
## 2 45.54250 15.79388 13.586402 4.824458   m   Field
## 3 49.87429 17.31542  5.308675 5.814398   m  T_400m
## 4 49.05707 17.19443  9.230973 5.337367   m  W_Polo
## 5 37.66307 12.99088 13.685909 4.020170   f     Gym
## 6 42.33715 14.62894 15.147165 4.440556   f Netball

Done. We now have a fully synthetic dataset which retains the properties of the original data. And it only took a few lines of code.

An important property of generating synthetic data is that it doesn’t use real data to do so, meaning any predictors need to be simulated first (my post on the synthpop package explains this in more detail). This property is retained since the data is generated sequentially as per the structure of network. Also, when using synthpop the order in which the variables are simulated needs to be set. The order can alter the accuracy of simulated dataset and so it’s important to spend the time to get it right. For a Bayesian network the order is determined by the structure, so in effect this step is already done.

Compare original and simulated datasets

The original and simulated datasets are compared in a couple of ways 1) observing the distributions of the variables 2) comparing the output from various models and 3) comparing conditional probability queries. The third test is more of a sanity check. If the data is generated from the original Bayes net then a new one fit on the simulated data should be approximately the same. The more rows we generate the closer the parameters will be to the original values.

The variable distributions are very close to the original with only a small amount of variation, mostly observed in sport. Red cell count may have a slight bi-modal distribution but in most part it’s a good fit. This amount of variation is reasonable since there are only 202 simulated observations. Simulating more rows will be a closer fit but there are often practical considerations for retaining the same size dataset.

For the second check, two linear models are fit to the original and simulated data to predict hematocrit levels with sex, hemoglobin concentration, percentage of body fat and red cell count as predictors. Sport was left out of the model since it was not a strong predictor of hc and only increased the error.

# fit models
glm_og <- glm(hc ~ hg + rcc + pcBfat + sex, data = ais_sub)
glm_sim <- glm(hc ~ hg + rcc + pcBfat + sex, data = ais_sim)

# print summary
summary(glm_og)

## 
## Call:
## glm(formula = hc ~ hg + rcc + pcBfat + sex, data = ais_sub)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.9218  -0.4868   0.0523   0.5470   2.8983  
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  5.15656    1.04109   4.953 1.57e-06 ***
## hg           1.64639    0.11520  14.291  < 2e-16 ***
## rcc          3.04366    0.31812   9.568  < 2e-16 ***
## pcBfat      -0.02271    0.01498  -1.517    0.131    
## sexm        -0.20182    0.23103  -0.874    0.383    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 0.888242)
## 
##     Null deviance: 2696.92  on 201  degrees of freedom
## Residual deviance:  174.98  on 197  degrees of freedom
## AIC: 556.25
## 
## Number of Fisher Scoring iterations: 2

summary(glm_sim)

## 
## Call:
## glm(formula = hc ~ hg + rcc + pcBfat + sex, data = ais_sim)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.2117  -0.6232   0.0089   0.5910   2.0073  
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  5.318057   0.953131   5.580  7.9e-08 ***
## hg           1.633765   0.107235  15.235  < 2e-16 ***
## rcc          2.905111   0.299397   9.703  < 2e-16 ***
## pcBfat      -0.001506   0.014768  -0.102    0.919    
## sexm         0.319853   0.220904   1.448    0.149    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 0.7520993)
## 
##     Null deviance: 3014.20  on 201  degrees of freedom
## Residual deviance:  148.16  on 197  degrees of freedom
## AIC: 522.64
## 
## Number of Fisher Scoring iterations: 2

The coefficients and test statistics of the models are very similar, so both datasets result in the same conclusions. Percent of body fat is the least accurate but still make the same conclusion. In practice you should fit more models to assess the quality of the simulated data.

As mentioned the third is more of a sanity check but it is also a good demonstration of the process. By fitting the same structure to the simulated data we expect to estimate the same parameters and calculate very similar conditional probabilities. Here we simulate 20 000 observations to better estimate the parameters. The conditional probability for the athletes red cell count given the sport they compete in i.e. what is the probability the athletes red cell count will be greater than where is the 33rd and 66th percentile?

library(progress) 

# fit bayes net with the same structure using simulated
ais_sim <- rbn(bn_mod, 2e4)
bn_mod_sim <- bn.fit(bn_struct, ais_sim, method = "mle")

# sport - mcpquery function at end of post - it generalises bnlearn::cpquery
# rcc is continuous so we're calculating the probability rcc > some value, here the 33rd and 66th percentile 
# the function replaces xx and yy with actually values and evaluates the text
given <- "sport == 'xx'"
event <- "rcc > yy"
vals <- as.matrix(expand.grid(sport = unique(ais_sub$sport), rcc = quantile(ais_sub$rcc, c(0.33, 0.66))))
orig <- mcpquery(bn_mod, values = vals, token = c("xx", "yy"), event, given, n = 1e6) %>% spread(rcc, cp)

## P( rcc > yy | sport == 'xx' )

sim <- mcpquery(bn_mod_sim, values = vals, token = c("xx", "yy"), event, given, n = 1e6) %>% spread(rcc, cp)

## P( rcc > yy | sport == 'xx' )

# join for display
left_join(orig, sim, by = "sport", suffix = c("_orig", "_sim"))

## # A tibble: 10 x 5
##    sport   `4.4533_orig` `4.9466_orig` `4.4533_sim` `4.9466_sim`
##    <chr>           <dbl>         <dbl>        <dbl>        <dbl>
##  1 B_Ball          0.687        0.310         0.682       0.302 
##  2 Field           0.764        0.386         0.768       0.384 
##  3 Gym             0.477        0.0658        0.475       0.0717
##  4 Netball         0.444        0.0584        0.442       0.0575
##  5 Row             0.652        0.270         0.654       0.271 
##  6 Swim            0.752        0.370         0.758       0.376 
##  7 T_400m          0.767        0.388         0.769       0.386 
##  8 T_Sprnt         0.822        0.449         0.826       0.445 
##  9 Tennis          0.640        0.251         0.639       0.249 
## 10 W_Polo          0.945        0.571         0.944       0.566

The conditional probabilities from the simulated data are very close to the original as expected. Now we can be confident that our simulated data can be used as an alternative to the original.

Impute data

Another useful property of Bayes nets is to impute missing values. This is easily done using impute. We’ll remove 25% of the observations from variables hg and hc, and allow the Bayes net to impute them.

ais_miss <- ais_sub
miss_id <- sample(1:nrow(ais_sub), 50)
ais_miss[miss_id, c("hg", "hc")] <- NA

# table counting the NA values in hg and hc
apply(ais_miss[,c("hg", "hc")], 2, function(x) table(is.na(x)))

##        hg  hc
## FALSE 152 152
## TRUE   50  50

The table confirms there are 50 missing observations from hemoglobin and hematocrit variables. Now impute using Bayesian likelihood weighting.

ais_imp <- impute(bn_mod, data = ais_miss, method = "bayes-lw")

Plotting the imputed against the true values shows the Bayes net imputed the missing values quite well.

I’ve only tested and shown two variables but the others would perform similarly for the subset of data I have chosen. This data is normally distributed so I expected it to work well, however if your data has more complex relationships you’ll need to be more rigorous with defining the structure.

Code bits

# plot network function
plot.network <- function(structure, ht = "400px", cols = "darkturquoise", labels = nodes(structure)){
  if(is.null(labels)) labels <- rep("", length(nodes(structure)))
  nodes <- data.frame(id = nodes(structure),
                      label = labels,
                      color = cols,
                      shadow = TRUE
                      )

  edges <- data.frame(from = structure$arcs[,1],
                      to = structure$arcs[,2],
                      arrows = "to",
                      smooth = FALSE,
                      shadow = TRUE,
                      color = "black")

  return(visNetwork(nodes, edges, height = ht, width = "100%"))
}

# conditional probability query functions
mcpquery <- function(mod, values, token, event, given, ns = 1e4){
  cat("P(", event, "|", given, ")\n")
  UseMethod("mcpquery", values)
}

# numeric
mcpquery.numeric <- function(mod, values, token, event, given, ns = 1e4){
  y <- rep(NA, length(values))
  pb <- progress_bar$new(
    format = "time :elapsedfull // eta :eta // :k of :n // P( :event | :given )", 
    clear = FALSE, total = length(values))
  for(k in 1:length(values)){
    givenk <- gsub(token, values[k], given)
    eventk <- gsub(token, values[k], event)
    pb$tick(token = list(k = k, n = length(values), event = eventk, given = givenk))
    y[k] <- eval(parse(text = paste0("cpquery(mod,", eventk, ",", givenk, ", n = ", ns, ", method = 'ls')")))
  }
  return(tibble(values = values, cp = y) %>% arrange(desc(cp)))
}

# character
mcpquery.character <- function(mod, values, token, event, given, ns = 1e4){
  y <- rep(NA, length(values))
  pb <- progress_bar$new(
    format = "time :elapsedfull // eta :eta // :k of :n // P( :event | :given )", 
    clear = FALSE, total = length(values))
  for(k in 1:length(values)){
    givenk <- gsub(token, values[k], given)
    eventk <- gsub(token, values[k], event)
    pb$tick(token = list(k = k, n = length(values), event = eventk, given = givenk))
    y[k] <- eval(parse(text = paste0("cpquery(mod,", eventk, ",", givenk, ", n = ", ns, ", method = 'ls')")))
  }
  return(tibble(values = values, cp = y) %>% arrange(desc(cp)))
}

# matrix
mcpquery.matrix <- function(mod, values, token, event, given, ns = 1e4){

  n <- nrow(values)
  y <- rep(NA, n)

  pb <- progress_bar$new(
    format = "time :elapsedfull // eta :eta // :k of :n // P( :event | :given )", 
    clear = FALSE, total = n)

  for(k in 1:n){
    givenk <- given
    eventk <- event
    for(j in 1:ncol(values)){
      givenk <- gsub(token[j], values[k,j], givenk)
      eventk <- gsub(token[j], values[k,j], eventk)
    }
    pb$tick(token = list(k = k, n = n, event = eventk, given = givenk))
    y[k] <- eval(parse(text = paste0("cpquery(mod,", eventk, ",", givenk, ", n = ", ns, ", method = 'ls')")))
  }
  out <- as.tibble(values) %>%
    bind_cols(tibble(cp = y))
  colnames(out)[1:ncol(values)] <- colnames(values)
  return(out)
}


# compare the synthetic and original data frames
df <- ais_sub %>% 
  mutate(type = "orig") %>% 
  bind_rows(
    rbn(bn_mod, 202) %>% 
      mutate(type = "sim")
    ) # %>% 
gg_list <- list()
grp_var <- "type"
vars <- colnames(df)[colnames(df) != grp_var]
for(k in 1:length(vars)){
  var_k <- vars[k]
  gg_list[[k]] <- ggplot(df, aes_string(x = var_k, fill = grp_var, col = grp_var))
  if(is.numeric(df[[var_k]])){
    gg_list[[k]] <- gg_list[[k]] + geom_density(alpha = 0.85, size = 0)
  }else{
    gg_list[[k]] <- gg_list[[k]] + geom_bar(position = "dodge")
  }
  gg_list[[k]] <- gg_list[[k]] + 
    theme(
      axis.text.x = element_text(angle = 90),
      axis.title.x = element_blank()
    ) +
    labs(title = var_k)
}

The post Simulating data with Bayesian networks appeared first on Daniel Oehm | Gradient Descending.

To leave a comment for the author, please follow the link and comment on their blog: R – Daniel Oehm | Gradient Descending.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.