[This article was first published on Wiekvoet, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Last week I tried pcalg. This week deal (Learning Bayesian Networks with Mixed Variables). The aim n this post I want to try something new, a causal graphical model. The aim here is just as much to get myself a feel what these things do as to understand how the stone flakes data fit together. Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Data
Data are stone flakes data which I analyzed previously. The first post was clustering, second linking to hominid type, third regression. Together these made for the bulk of a standard analysis. In this new analysis the same starting data is used.r2 <- read.table(‘StoneFlakes.txt’,header=TRUE,na.strings=’?’)
r1 <- read.table(‘annotation.txt’,header=TRUE,na.strings=’?’)
r12 <- merge(r1,r2)
r12$group <- factor(r12$group,labels=c(‘Lower Paleolithic’,
‘Levallois technique’,
‘Middle Paleolithic’,
‘Homo Sapiens’))
r12$site <- factor(c(‘other’,’gravel pit’)[r12$site+1])
r12$mat <- factor(c(‘flint’,’other’)[r12$mat])
r12$lmage <- log10(-r12$age)
Deal
The starting point of this post was to continue/ my analysis of last week. But when I discovered deal could be used to discover the model, repeating last week’s analysis was chosen instead. Deal does not have a Vignette, but there is a paperdeal: A Package for Learning Bayesian Networks which helped me very well to get started.
First Model
Initially I wanted to start with a model containing only continuous variables, similar to before, but that threw an error in jointprior(). Hence I added groups as factor. Autosearch() and heuristicsearch() give a lot of output, basically one line for each step. For brievety these are not shown. The good thing about this model is that it has a solution where ‘group’ is driving other variables.library(deal)
rfin <- subset(r12,,c(names(r2)[-1],’group’))
rfin <- rfin[complete.cases(rfin),]
rfin.nw <- network(rfin)
rfin.prior <- jointprior(rfin.nw)
Imaginary sample size: 8
rfin.nw <- learn(rfin.nw,rfin,rfin.prior)$nw
rfin.search <- autosearch(rfin.nw,
rfin,
rfin.prior,
trace=FALSE)
plot(rfin.search$nw)
Heuristic is used to further improve the model. In the end the model seems a bit more complex that pcalg, but not unreasonably so.
rfin.heuristic <- heuristic(rfin.search$nw,
rfin,
rfin.prior,
restart=10,
trace=FALSE,
trylist=rfin.search$trylist)
plot(rfin.heuristic$nw)
Second model
For brevity I won’t be repeating the code. It is all the same except for the data going in, which will be shown. The second model is similar to the first, but the (potential) outliers have been removed. This model looks even more clean.rfin <- subset(r12,
!(r12$ID %in% c(‘ms’,’c’,’roe’,’sz’,’va’,’arn’)),
c(names(r2)[-1],’group’))
rfin <- rfin[complete.cases(rfin),]
Third model
Including all sensible factors is the model I wanted to do. However, it seemed the imaginary sample size grew to 96. It is my experience that higher imaginary sample sizes produce more complex networks and longer run times. The current model seems a bit too complex to my liking. Moving under the recommended number gave runtime errors.Final model
In the final model (a few are skipped now) a number of simplifications were made. Region is removed as factor, log(-age) as continuous variable. Group has lost Homo Sapiens, since that category had only three records. The model is restricted in the sense that group cannot be the result of other variables. In the plot these are shown as red arrows.rfin <- subset(r12,
!(r12$ID %in% c(‘ms’,’c’,’roe’,’sz’,’va’,’arn’)),
c(-ID,-number,-age,-dating,-region,-lmage))
rfin <- rfin[rfin$group !=’Homo Sapiens’,]
rfin <- rfin[complete.cases(rfin),]
rfin.nw <- network(rfin)
rfin.prior <- jointprior(rfin.nw)
mybanlist <- matrix(
c(2:11,
rep(1,10)),ncol=2)
banlist(rfin.nw) <- mybanlist
Conclusion
Deal makes too complex networks for my liking, pcalg cannot use discrete variables. Deal has banned links, a feature which helps. pcalg made more nice plots, but I have the feeling that is relatively easily remedied. Neither gave a model which struck me as a model to continue with. I’ll be needing quite some more study to feel comfortable with this kind of models.To leave a comment for the author, please follow the link and comment on their blog: Wiekvoet.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.