[This article was first published on Wiekvoet, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
The Wheel of Time is a series of books started by Robert Jordan. Unfortunately he died too early. Like all fans of the series I feel very lucky that Brandon Sanderson was able to continue these books. The first book Sanderson wrote was the Gathering Storm, last one is due January 2013. In this post it is examined how common words can be used differentiate between books written by Sanderson and those written by Jordan.Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Data
The training data used are some of the books by Sanderson and Jordan. They form three categories; Robert Jordan Wheel of Time, Robert Jordan other and Brandon Sanderson various.
- the Eye of the World (Wheel of Time) by Robert Jordan
- the Fires of Heaven (Wheel of Time) by Robert Jordan
- Elantris by Brandon Sanderson
- Warbreaker by Brandon Sanderson
- Prince of the Blood (other) by Robert Jordan
- Conan the Defender (other) by Robert Jordan
The test set is three books;
- Knife of Dreams (Wheel of Time) by Robert Jordan
- Mistborn by Brandon Sanderson
- the Gathering Storm (Wheel of Time) by Brandon Sanderson and Robert Jordan
All books were acquired via darknet and read into R as a vector with one element per chapter. Prologue and epilogue count for separate chapters. The relative amount of common words is counted in each chapter. In this case, common words are defined as stopwords from the tm package. For example;
tm::stopwords(“English”)[1:5]
[1] “a” “about” “above” “across” “after”
Two functions were devised to count the relative occurrence of these words per chapter:
numwords <- function(what,where) {
g1 <- gregexpr(paste(‘[[:blank:]]+[[:punct:]]*’,what,'[[:punct:]]*[[:blank:]]+’,sep=”),where,perl=TRUE,ignore.case=TRUE)
if (g1[[1]][1]==-1) 0L
else length(g1[[1]])
}
countwords <- function(book) {
sw <- tm::stopwords(“English”)
la <- lapply(book,function(where) {
sa <- sapply(sw,function(what) numwords(what,where))
ntot <- length(gregexpr(‘[[:blank:]]+’,
where,perl=TRUE,ignore.case=TRUE)[[1]])
sa/ntot
} )
mla <- t(do.call(cbind,la))
}
# words are counted
wtEotW <- countwords(tEotW)
wElantris <- countwords(Elantris)
wtFoH <- countwords(tFoH)
wWarbreaker <- countwords(Warbreaker)
wPotB <- countwords(PotB)
wConan <- countwords(Conan)
wtGS <- countwords(tGS)
wMistborn <- countwords(Mistborn)
wKoD <- countwords(KoD)
Model
Random forest is used as the number of variables is much bigger than the number of objects.
#combine the counts and make predictions
all <- rbind(wElantris,wWarbreaker,wtEotW,wtFoH,wPotB,wConan)
cats <- factor(c(
rep(‘BS’,nrow(wElantris)),
rep(‘BS’,nrow(wWarbreaker)),
rep(‘WoT’,nrow(wtEotW)),
rep(‘WoT’,nrow(wtFoH)),
rep(‘RJ’,nrow(wPotB)),
rep(‘RJ’,nrow(wConan))
),levels=c(‘BS’,’WoT’,’RJ’))
rf1 <- randomForest(y=cats,x=all,importance=TRUE)
rf1
Call:
randomForest(x = all, y = cats, importance = TRUE)
Type of random forest: classification
Number of trees: 500
No. of variables tried at each split: 22
OOB estimate of error rate: 3.93%
Confusion matrix:
BS WoT RJ class.error
BS 124 0 1 0.008000000
WoT 0 110 1 0.009009009
RJ 5 4 35 0.204545455
varImpPlot(rf1)
Words which discriminate between the three categories are such as ‘and’, ‘did’t’ and ‘not’. The next figure shows the typical usage of nine words. Note that the data has been scaled at this point in order to make the display more easy to read.
im <- importance(rf1)
toshow <- rownames(im)[order(-im[,’MeanDecreaseGini’])][1:9]
tall <- as.data.frame(scale(all[,toshow]))
tall$chapters <- rownames(tall)
tall$cats <- cats
rownames(tall) <- 1:nrow(tall)
propshow <- reshape(tall,direction=’long’,
timevar=’Word’,
v.names=’ScaledScore’,
times=toshow,
varying=list(toshow))
bwplot( cats ~ScaledScore | Word,data=propshow)
Based on this it seems Sanderson would use contractions such as ‘didn’t’, which Jordan did not. Jordan used ‘not’, ‘and’ and ‘or’ more often. ‘However is very much Sanderson.
Predictions
For predictions I took the predicted proportion trees for each category, as this shows a bit of the uncertainty in the categorization, which I find of interest. To display the predictions density plots are used. Each pane in the plot shows the strength of the associations between books and categories. The higher the values, the stronger association. Each row represents a book, each column a category.
ptGS <- predict(rf1,wtGS,type=’prob’)
pMistborn <- predict(rf1,wMistborn,type=’prob’)
pKoD <- predict(rf1,wKoD,type=’prob’)
preds <- as.data.frame(rbind(ptGS,pMistborn,pKoD))
preds$Book <- c(rep(‘the Gathering Storm’,nrow(ptGS)),
rep(‘Mistborn’,nrow(pMistborn)),rep(‘Knife of Dreams’,nrow(pKoD)))
predshow <- reshape(preds,direction=’long’,
timevar=’Prediction’,v.names=’Score’,times=c(‘BS’,’WoT’,’RJ’),
varying=list(w=c(‘BS’,’WoT’,’RJ’)))
densityplot(~Score | Prediction + Book,data=predshow)
Interpretation
Knife of dreams is correctly categorized as Wheel of Time, Mistborn is correctly categorized as Sanderson. This shows the predictions are indeed performing well and the item of interest can be examined; the Gathering Storm. It sits solidly in the Sanderson category. Interestingly, it sits a little bit less in Sanderson than Mistborn and sits a bit more in Wheel of Time than Mistborn.
To leave a comment for the author, please follow the link and comment on their blog: Wiekvoet.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.