The 1000 most-visited sites analyzed using R
[This article was first published on R-Chart, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Ever wondered about which Computer and Electronics web sites get the most Page Views? Based upon data recently published by Google:
The R program to create this graph is as follows:
library(XML)
# URL for the Google Data
u=”http://www.google.com/adplanner/static/top1000/”
tables = readHTMLTable(u)
l=tables[[2]]
# Name the columns
colnames(l)=c(‘Rank’,’Site’,’Category’,’Users’,’Reach’,’Views’,’Advertising?’)
# Extract Computers and Electronics Subset
CandE=l[l$Category==’Computers & Electronics’, c(‘Site’,’Views’)]
rownames(CandE)=CandE[,1]
CandE=CandE[-1]
CandE$Views=as.numeric(gsub(‘,’,”,CandE$Views))
CandE[,1]=CandE[order(CandE[,1],decreasing=TRUE),]
par(las=2, mar=c(12, 10, 1, 2) + 0.1)
barplot(t(as.matrix(CandE)),yaxt = “n”, ylab = “”, main=”Top Computers & Electronics Sites”, col=”orange”)
axis(side = 2, scientific = FALSE, big.mark = “,”)
Discussion and Interactive Commands…
Google recently posted the 1000 most-visited sites on the web. The data is displayed in tabular format, but is a bit unwieldy to view interactively. Sounds like a candidate for some analysis using R and the XML package! I was very impressed how easy it is to scrape the relevant data and produce meaningful summarizations with only a few lines of code.
library(XML)
u=”http://www.google.com/adplanner/static/top1000/”
tables = readHTMLTable(u)
l=tables[[2]]
colnames(l)=c(‘Rank’,’Site’,’Category’,’Users’,’Reach’,’Views’,’Advertising?’)
These few lines of code read in the data available at the site and assigns column names. To see a sample of the data we now have available:
head(l)
Both numerical and categorical data is available in the table. For example, each site is categorized by whether or not they include advertising.
summary(l$Advertising)
68% of the sites listed do advertise:
ad=summary(l$Advertising)
(ad[1] / ad[2]) * 100
Plotting subsets of the data is probably the best way to go – but to start, I wanted to see which categories of sites appear most frequently on the list.
par(las=2, mar=c(4, 12, 1, 2) + 0.1)
barplot(sort(summary(l$Category), decreasing=FALSE), horiz=TRUE, cex.names=0.6)
I realize that this is a impossible to read – you will need to run it yourself. There are 216 unique categories.
length(unique(l$Category))
The Top 10 are:
- Other
- Web Portals
- Social Networks
- Online Games
- File Sharing & Hosting
- Newspapers
- Blogging Resources & Services
- News & Current Events
- Computers & Electronics
- Search Engines
I’ll leave subsequent analysis to you. To limit yourself to specific columns:
l[,c(‘Site’,’Users’,’Views’,’Reach’)]
You probably will want to begin munging the numeric data, and will need to do something to interpret the string values as numbers.
l$Views=gsub(‘,’,”,l$Views)
l$Reach=gsub(‘%’,”,l$Reach)
l$Users=gsub(‘,’,”,l$Users)
Having done so, you will be able to plot them…
plot(l$Reach)
plot(l$Users)
It might be more interesting to focus on sites in a particular category. For example, if your niche is Cooking and Recipes:
l[l$Category==’Cooking & Recipes’,c(‘Site’,’Users’,’Views’,’Reach’)]
And for sites dedicated to the Java programming language:
l[l$Category==’Java’, c(‘Site’,’Users’,’Views’,’Reach’)]
I’d love to see your ideas for analyzing this data in the comments…it’s a great opportunity to show off your analytical and R skills!
To leave a comment for the author, please follow the link and comment on their blog: R-Chart.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.