Log File Analysis with R
[This article was first published on R-Chart, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
R often comes up in discussions of heavy duty scientific and statistical analysis (and so it should). However, it is also incredibly handy for a variety of more routine developer activities. And so I give you… log file analysis with R!
I was just involved in the launch of gradesquare.com (go ahead – click on the link and check it out. We will still be here later!). With the flurry of recent activity, I needed a way to visualize and communicate site activity to the rest of the team. It only takes a few lines of R to read in a log file (of a reasonable size), format the data, and generate some usable charts. Like most good ideas – it is not new. Most log files follow a similar format (such as common log format) so there may be some minor variations to the following exercise.
The only library that I used for this example was ggplot2 for charts.
library(ggplot2)
Read the Log File
A sample of the log file (miserably wrapped – my apologies):
66.12.71.25 – – [21/Feb/2012 23:44:11] “GET /course/1894/detail HTTP/1.1” 200 7017 5.0829
66.12.71.21 – – [21/Feb/2012 23:44:39] “GET /search_by_author?search_learn_exp=Khan+Academy&page=193 HTTP/1.1” 200 8019 0.3288
66.12.71.25 – – [21/Feb/2012 23:45:21] “GET /course/19/detail HTTP/1.1” 200 6851 0.1213
18.4.5.14 – – [21/Feb/2012 23:45:59] “GET /search_by_subject?search_learn_exp=algebra-i-worked-examples HTTP/1.1” 200 7939 0.0370
If you can’t make that out – just know that it is a relatively typical log file that includes the IP address of the client request, the date and time, the HTTP method and URL path, the HTTP request status code, a count of bytes returned and the time required for the request to process.
The log file can be read into a data frame as follows.
df = read.table(‘webapp.log’)
There are a lot of different options available – and you might want to take advantage of these to minimize the amount of additional cleanup required after loading the file. For details:
help(read.table)
Clean Up and Format
I chose to clean up manually after the fact. To start, we name the columns in the data frame.
colnames(df)=c(‘host’,’ident’,’authuser’,’date’,’time’,’request’,’status’,’bytes’,’duration’)
The date and time were split up when read in above. I am not concerned with the time at this point but do want the date to be cast to a date type.
df$date=as.Date(df$date,”[%d/%b/%Y”)
To see the column names and first few rows of our data frame…
head(df)
There are a number of different ways of getting a quick handle on the data – you could do a summary for instance. One item that you might want to have is a the number of requests for HTTP status.
table(df$status)
But the item of immediate interest is simply the number of requests. The following will provide the number of requests by date.
reqs=as.data.frame(table(df$date))
R is really great for these quick summarizations, and if you memorize a few functions you will be able to address most needs easily. At a certain point, I can better visualize data problems using SQL, and so use the sqldf library. For now – on to some charts using ggplot2.
Make Some Charts
ggplot(data=reqs, aes(x=as.Date(Var1), y=Freq)) + geom_line() + xlab(‘Date’) + ylab(‘Requests’) + opts(title=’Traffic to Site’)
On the other hand, the format function is used in this example to cause the (http) status value to be treated as discrete.
ggplot(data=df, aes(x=format(status))) + geom_bar() + xlab(‘Status’) + ylab(‘Count’) + opts(title=’Status’)
By the way, the images were exported as pngs for the blog by assigning the chart to a variable p and printing like so:
png(“imagename.png”)
print(p)
dev.off()
So there you have it – functional, useful R that addresses a practical every day need of web developers. It is also a great, practical task that can introduce you to R with a simple relevant exercise that provides immediate value.
The next time Google Analytics falls short, pull out R and give it a try!
To leave a comment for the author, please follow the link and comment on their blog: R-Chart.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.