[This article was first published on Maximize Productivity with Industrial Engineer and Operations Research Tools, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
I have been using a lot of R lately in my work. R (main site) is an open source statistical computing platform. Saying R is only used for statistics does not do it justice. I am finding it to be a really powerful statistical and optimization computing platform. There seems to be no task that can not be accomplished. Lately I’ve been curious about measuring performance with my blog and how it compares to other blogs. So I thought I would use this opportunity to show how I performed this in R. I want to rank Operations Research blogs using the Alexa ranking system. Unfortunately Alexa does not have a search function for Operations Research blogs so I am going to have to get the information myself using R.Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
This R tutorial is going to be using the package XML. Packages are used in R to perform specific computational needs that the base R platform can not accomplish on its own. There are several different packages that can be loaded into R to perform a wide variety of problem instances.
The first step is to load the XML package into the current R workspace. If you do not have the XML package installed on your computer you will have to get it installed from the CRAN repositories.
After loading the XML package is where the problem set programming begins. I will need to save into the workspace the url of the Alexa information. Once I have the variables then I can move onto using the XML package to gather the information.
The main functions used in the XML package are htmlTreeParse, getNodeSet, and readHTMLTable. htmTreeParse grabs the XML code from the URL and stores it into an XML readable format. getNodeSet is a retrieval function that grabs only the data you specifify. In this instance it is looking for the XML nodes of dir and table with a id value equal to siteStats. The readHTMLTable takes the siteStats information and creates a table of data values.
While gathering the Alexa information with XML I’m also going to have to format the data into a readable structure. This will require tabulating and text string manipulation. Notice the use of the functions table, strsplit, and gsub to format the data. All of this is performed in a for loop that performs all of XML and text formatting one URL at a time. I’ve also created a data frame to place all of the relevant information into a readable table.
The following is the R code.
library(XML)
urlbeg <- “http://www.alexa.com/siteinfo/”
urllist <- c(
“industrialengineertools.blogspot.com”,
“punkrockor.wordpress.com”,
“thinkor.org”,
“john-poppelaars.blogspot.com”,
“bit-player.org”,
“opsres.wordpress.com”,
“orbythebeach.wordpress.com”,
“spokutta.wordpress.com”,
“engineered.typepad.com”,
“bernoulli-on-business.blogspot.com”,
“greenor.wordpress.com”,
“fmwaves.kproductivity.com”,
“blog.intechne.com”,
“jimorlin.wordpress.com”,
“jtonedm.com”,
“mswd.wordpress.com”,
“www.hakank.org”,
“optandor.com”,
“stochastix.wordpress.com”,
“restart2.blogspot.com”,
“scottaaronson.com”,
“ateji.blogspot.com”,
“geomblog.blogspot.com”,
“ormsblog.com”,
“wehart.blogspot.com”,
“yetanothermathprogrammingconsultant.blogspot.com”,
“annanagurney.blogspot.com”,
“healthyalgorithms.wordpress.com”,
“iaoreditor.blogspot.com”,
“openresearch.wordpress.com”,
“ornotes.blogspot.com”,
“reflectionsonor.wordpress.com”,
“arandomforest.com”,
“analytics-magazine.com”,
“hsimonis.wordpress.com”,
“cpstandard.wordpress.com”,
“blog.athico.com”,
“dualnoise.blogspot.com”,
“geneticargonaut.blogspot.com”,
“john-raffensperger.blogspot.com”,
“orforum.blog.informs.org”,
“orinanobworld.blogspot.com”,
“www.or-exchange.com”,
“pomsblog.wordpress.com”,
“research-reflections.blogspot.com”,
“www.scienceofbetter.org”,
“operationsroom.wordpress.com”
)
ORrank <- data.frame()
for (i in c(1:length(urllist)) ){
url <- paste(urlbeg, urllist[i], sep=””)
doc <- htmlTreeParse(url, useInternalNodes=T)
nset <- getNodeSet(doc, “//div/table[@id=’siteStats’]”)
tables <- lapply(nset, readHTMLTable)
rankstr <- tables[[1]][2]
rankstrdf <- strsplit(as.character(rankstr$V2), “\n”)
rank <- gsub(” “,””,rankstrdf[[1]][1])
rank <- as.numeric(gsub(“,”,””,rank))
tmpdf <- data.frame(ORblog=urllist[i], AlexaRank=rank)
ORrank <- rbind(ORrank, tmpdf)
rm(url)
rm(doc)
rm(nset)
rm(tables)
rm(rankstr)
rm(rankstrdf)
rm(rank)
rm(tmpdf)
}
rm(i)
ORrank <- ORrank[order(ORrank$AlexaRank),]
rownames(ORrank) <- 1:nrow(ORrank)
print(ORrank)
Here is a final output from the ORrank data frame.
ORblog AlexaRank
1 orforum.blog.informs.org 154736
2 scottaaronson.com 308410
3 bit-player.org 1444318
4 blog.athico.com 1484646
5 jtonedm.com 1504658
6 operationsroom.wordpress.com 1631529
7 geomblog.blogspot.com 1711672
8 www.hakank.org 1955830
9 www.scienceofbetter.org 2550459
10 engineered.typepad.com 2625563
11 stochastix.wordpress.com 3002085
12 punkrockor.wordpress.com 3303052
13 openresearch.wordpress.com 3811636
14 hsimonis.wordpress.com 4068033
15 fmwaves.kproductivity.com 4281627
16 annanagurney.blogspot.com 5047922
17 www.or-exchange.com 6052089
18 thinkor.org 6134442
19 analytics-magazine.com 6674061
20 healthyalgorithms.wordpress.com 7373428
21 john-raffensperger.blogspot.com 8516473
22 greenor.wordpress.com 8666209
23 orbythebeach.wordpress.com 9437585
24 arandomforest.com 12225347
25 mswd.wordpress.com 12571553
26 blog.intechne.com 13784064
27 spokutta.wordpress.com 15236071
28 cpstandard.wordpress.com 19401625
29 geneticargonaut.blogspot.com 20064295
30 ormsblog.com 21294575
31 wehart.blogspot.com 22329286
32 yetanothermathprogrammingconsultant.blogspot.com 24431355
33 dualnoise.blogspot.com 25165358
34 ateji.blogspot.com 25304653
35 reflectionsonor.wordpress.com 27537074
36 industrialengineertools.blogspot.com NA
37 john-poppelaars.blogspot.com NA
38 opsres.wordpress.com NA
39 bernoulli-on-business.blogspot.com NA
40 jimorlin.wordpress.com NA
41 optandor.com NA
42 restart2.blogspot.com NA
43 iaoreditor.blogspot.com NA
44 ornotes.blogspot.com NA
45 orinanobworld.blogspot.com NA
46 pomsblog.wordpress.com NA
47 research-reflections.blogspot.com NA
Not exactly in the friendliest of formats but it does the trick. I hope that this will help others who wish to use the powerful XML package with R. I know I have definitely learned a lot about XML in the process. I also found out that I have a lot more work to do with my blog.
Note: If you are wondering where Michael Trick’s blog is located there is a reason. Unfortunately his blog and some others are in a sub-domain of another url not affiliated with his blog. This means Alexa can not rank it compared to blogs with a primary domain. Yet everyone in the Operations Research community knows where Michael’s blog ranks anyway.
To leave a comment for the author, please follow the link and comment on their blog: Maximize Productivity with Industrial Engineer and Operations Research Tools.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.