Site icon R-bloggers

Using wordcloud on search terms & phrases

[This article was first published on The Schmitt-R, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
The wordcloud package for R is great, but all the examples I found used the tm package to process a large amount of textual data (web pages, text files, google docs, etc.)

But what if you have normalized data where you have a word and its frequency? Or, what if you have phrases that you want in a wordcloud? One example being terms which users have entered into a web search.

I happen to be pulling from a data source via PHP and then I output the data to CSV format in descending order by frequency.

The relevant part of the PHP script (after populating the array $terms):

$cwd = getcwd();
$local_path = $cwd.’/csv/’;
$filename = $local_path.’searchterms.csv’;
$fp = fopen($filename, ‘w’);
fputcsv($fp, array(‘term’,’freq’));
arsort($terms); //reverse sort array by values
$max_terms = 100;
$i = 0;
foreach ($terms as $q => $v) {
    $i++;
    if ($v > $min_freq) fputcsv($fp, array($q,$v));
    if ($i > $max_terms) break;
}
fclose($fp);

Here is the sample data:

term,freq
“target black friday”,8239
“walmart layaway”,6502
“america idol”,1777
“american idol episodes”,1741
“mexican train domino game”,1585
“jc penny outlet store”,1159
“the chicago code”,1130

The R script:

require(wordcloud)
require(RColorBrewer)
datain <- read.csv(“csv/searchterms.csv”, colClasses=c(“character”, “numeric”))
pal2 <- brewer.pal(8,”Dark2″)
png(“wordcloud.png”, width=1000,height=1000)
wordcloud(datain$term,datain$freq, scale=c(8,.4),min.freq=1, max.words=Inf, random.order=FALSE, rot.per=.15, colors=pal2)
dev.off()

One consideration is that if a search phrase is too long, R will produce a warning and omit it from the resulting wordcloud, so you need to compensate with the image dimensions. It may be possible to dynamically scale the image based on the string length of the highest frequency result.

Here is the resulting wordcloud:

For more on R, visit http://www.r-bloggers.com/

To leave a comment for the author, please follow the link and comment on their blog: The Schmitt-R.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.