24 Days of R: Day 11
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
I don't know how often Michael Caine appeared in a Shakespearean work, but I'm sure that he has and I'm sure that he was excellent. A bit pressed for time today, so just a simple word cloud featuring the full text of King Lear. I found the text at a website that I presume is associated with a university in Cambridge. http://shakespeare.mit.edu/lear/full.html I stored a local copy.
My sister lives in Stratfrod-Upon-Avon and can't stop talking about Shakespeare. Today's post is dedicated to her.
aFile = readLines("./Data/Lear.txt") library(tm) myCorpus = Corpus(VectorSource(aFile)) myCorpus = tm_map(myCorpus, tolower) myCorpus = tm_map(myCorpus, removePunctuation) myCorpus = tm_map(myCorpus, removeNumbers) myCorpus = tm_map(myCorpus, removeWords, stopwords("english")) myDTM = TermDocumentMatrix(myCorpus, control = list(minWordLength = 1)) m = as.matrix(myDTM) v = sort(rowSums(m), decreasing = TRUE) library(wordcloud) set.seed(1234) wordcloud(names(v), v, min.freq = 15)
A lot of “king”, “lear”, “thee”, “thy” and “thou”.
And of course in searching for a reference, for the code above (I modified from it something else), I came across this: Text mining Shakespeare. I feel even lazier than I did before.
I can't leave it at that, so I'll very quickly determine the most frequent 2 and 3 word phrases in the text.
library(tau) bigrams = textcnt(aFile, n = 2, method = "string") bigrams = bigrams[order(bigrams, decreasing = TRUE)] bigrams[1] ## king lear ## 209 bigrams[2] ## my lord ## 76 trigrams = textcnt(aFile, n = 3, method = "string") trigrams = trigrams[order(trigrams, decreasing = TRUE)] trigrams[1] ## king lear no ## 13 trigrams[2] ## i know not ## 12
No surprises that the most frequent bigram is “king lear” at 209 times and “my lord” is the sort of thing one would expect in an Elizabethan play. I like that the most frequent trigram is “king lear no” at 13. I'll have to have a look at the text to see what's behind that.
sessionInfo() ## R version 3.0.2 (2013-09-25) ## Platform: x86_64-w64-mingw32/x64 (64-bit) ## ## locale: ## [1] LC_COLLATE=English_United States.1252 ## [2] LC_CTYPE=English_United States.1252 ## [3] LC_MONETARY=English_United States.1252 ## [4] LC_NUMERIC=C ## [5] LC_TIME=English_United States.1252 ## ## attached base packages: ## [1] stats graphics grDevices utils datasets methods base ## ## other attached packages: ## [1] wordcloud_2.4 RColorBrewer_1.0-5 Rcpp_0.10.6 ## [4] knitr_1.4.1 RWordPress_0.2-3 tau_0.0-15 ## [7] tm_0.5-9.1 ## ## loaded via a namespace (and not attached): ## [1] digest_0.6.3 evaluate_0.4.7 formatR_0.9 parallel_3.0.2 ## [5] RCurl_1.95-4.1 slam_0.1-30 stringr_0.6.2 tools_3.0.2 ## [9] XML_3.98-1.1 XMLRPC_0.3-0
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.