Estimating mixed graphical models
[This article was first published on Jonas Haslbeck - r, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Determining conditional independence relationships through undirected graphical models is a key component in the statistical analysis of complex obervational data in a wide variety of disciplines. In many situations one seeks to estimate the underlying graphical model of a dataset that includes variables of different domains.
As an example, take a typical dataset in the social, behavioral and medical sciences, where one is interested in interactions, for example between gender or country (categorical), frequencies of behaviors or experiences (count) and the dose of a drug (continuous). Other examples are Internet-scale marketing data or high-throughput sequencing data.
There are methods available to estimate mixed graphical models from mixed continuous data, however, these usually have two drawbacks: first, there is a possible information loss due to necessary transformations and second, they cannot incorporate (nominal) categorical variables (for an overview see here). A new method implemented in the R-package mgm addresses these limitations.
In the following, we use the mgm-package to estimate the conditional independence network in a dataset of questionnaire responses of individuals diagnosed with Autism Spectrum Disoder. This dataset includes variables of different domains, such as age (continuous), type of housing (categorical) and number of treatments (count).
The dataset consists of responses of 3521 individuals to a questionnaire including 28 variables of domains continuous, count and categorical.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
dim(data)
## [1] 3521 28
round(data[1:4, 1:5],2)
## sex IQ agediagnosis opennessdiagwp successself
## [1,] 1 6 0 1 1.92
## [2,] 2 6 7 1 5.40
## [3,] 1 5 4 2 5.66
## [4,] 1 6 8 1 8.00
type <- c("c", "g", "g", "c", "c", "g", "c", "c", "p", "p",
"p", "p", "p", "p", "c", "p", "c", "g", "p", "p",
"p", "p", "g", "g", "g", "g", "g", "g", "c", "c",
"g")
cat <- c(2, 1, 1, 3, 2, 1, 5, 3, 1, 1, 1, 1, 1, 1, 2, 1, 4,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 3, 2, 1)
library(mgm)
fit <- mgmfit(data, type, cat, lamda.sel="EBIC", d=2)
# group variables
group_list <- list("Demographics"=c(1,14,15,28),
"Psychological"=c(2,4,5,6,18,20,21),
"Social environment" = c(7,16,17,19,26,27),
"Medical"=c(3,8,9,10,11,12,13,22,23,24,25))
# define nice colors
group_cols <- c("#E35959","#8FC45A","#4B71B3","#E8ED61")
# plot
library(qgraph)
qgraph(fit$adj,
vsize=3, layout="spring",
edge.color = rgb(33,33,33,100,
maxColorValue = 255),
color=group_cols,
border.width=1.5,
border.color="black",
groups=group_list,
nodeNames=datalist$colnames,
legend=TRUE,
legend.mode="groups",
legend.cex=.75)
To leave a comment for the author, please follow the link and comment on their blog: Jonas Haslbeck - r.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.