[This article was first published on binfalse » R, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
These days one can frequently read about wordclouds created with R, initiated by the release of the wordcloud package by Ian Fellows on July 23rd. So here I am to put in my two cents.
I thought about creating a wordcloud of a complete blog history, so I build a script that connects to a MySQL database and grabs all published posts and pages. All articles are combined in an huge text, that, when purged from tags and special chars, is visualized as a wordcloud:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 | library(RMySQL) require(wordcloud) require(RColorBrewer) # special chars we want to delete sent=c(",", "\\.", ";", "=", ":", "\\?", "!", "-", "\\(", "\\)", "\\*", "&", "%", "$", "\\+", """, "‘", "<", ">", "\\[", "\\]", "\\{", "\\}", "\\/", "\\\") # wordpress bb-codes, also delete! bbcd=c("\\[cc.+?/cci?\\]", "\\[latex.+?/latex\\]", "\\[caption.+?/caption\\]") # and of course delet HTML tags tags=c("a", "b", "abbr", "strong", "em", "i", "p", "more", "td", "table", "tr", "th", "script", "h1", "h2", "h3", "h4", "h5", "h6", "div", "span", "small","img") tags=paste("</?", tags, "[^>]*>", sep="") # combine all purge-regex’ repl=c(tags, bbcd, sent) # connect to your DB con <- dbConnect(MySQL(), user="USER", password="PASSPHRASE", dbname="DB", host="HOST") # select all published articles res <- dbGetQuery(con, "SELECT post_content, post_title FROM wp_posts WHERE post_status=’publish’") #combine them in a text text=paste(as.matrix(res), collapse=" ") dbDisconnect(con) # replace all unwanted stuff tmp=sapply(repl, function (r) text<<-gsub(r, " ", text)) # here are our words: words=table(strsplit(tolower(text), "\\s+")) # remove words with _bad_ chars (non utf-8 stuff) words=words[nchar(names(words), "c")==nchar(names(words), "b")] # remove words shorter then 4 chars words=words[nchar(names(words), "c")>3] # remove words accuring less than 5 times words=words[words>4] # create the image png("/tmp/cloud.png", width=580, height=580) pal2 <- brewer.pal(8,"Set2") wordcloud(names(words), words, scale=c(9,.1),min.freq=3, max.words=Inf, random.order=F, rot.per=.3, colors=pal2) dev.off() |
Enough code, here is the result for my slight blog:
Smart image, isn’t it? Unfortunately it takes about 30 secs to generate it, otherwise it would be cool to create such a cloud live with for example rApache.
Download:
R: wordpress-wordcloud.R
(Please take a look at the man-page. Browse bugs and feature requests.)
R: wordpress-wordcloud.R
(Please take a look at the man-page. Browse bugs and feature requests.)
To leave a comment for the author, please follow the link and comment on their blog: binfalse » R.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.