Site icon R-bloggers

Log File Analysis with R

[This article was first published on R-Chart, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

 

R often comes up in discussions of heavy duty scientific and statistical analysis (and so it should).  However, it is also incredibly handy for a variety of more routine developer activities.   And so I give you… log file analysis with R!  

I was just involved in the launch of gradesquare.com (go ahead – click on the link and check it out.  We will still be here later!).  With the flurry of recent activity, I needed a way to visualize and communicate site activity to the rest of the team.  It only takes a few lines of R to read in a log file (of a reasonable size), format the data, and generate some usable charts.  Like most good ideas – it is not new.  Most log files follow a similar format (such as common log formatso there may be some minor variations to the following exercise.

The only library that I used for this example was ggplot2 for charts.  
library(ggplot2)

Read the Log File
A sample of the log file (miserably wrapped – my apologies):


66.12.71.25 – – [21/Feb/2012 23:44:11] “GET /course/1894/detail HTTP/1.1” 200 7017 5.0829
66.12.71.21 – – [21/Feb/2012 23:44:39] “GET /search_by_author?search_learn_exp=Khan+Academy&page=193 HTTP/1.1” 200 8019 0.3288
66.12.71.25 – – [21/Feb/2012 23:45:21] “GET /course/19/detail HTTP/1.1” 200 6851 0.1213
18.4.5.14 – – [21/Feb/2012 23:45:59] “GET /search_by_subject?search_learn_exp=algebra-i-worked-examples HTTP/1.1” 200 7939 0.0370



If you can’t make that out – just know that it is a relatively typical log file that includes the IP address of the client request, the date and time, the HTTP method and URL path, the HTTP request status code, a count of bytes returned and the time required for the request to process.



The log file can be read into a data frame as follows.

df = read.table(‘webapp.log’)

There are a lot of different options available – and you might want to take advantage of these to minimize the amount of additional cleanup required after loading the file.  For details:

help(read.table)






Clean Up and Format 
I chose to clean up manually after the fact.  To start, we name the columns in the data frame.


colnames(df)=c(‘host’,’ident’,’authuser’,’date’,’time’,’request’,’status’,’bytes’,’duration’)


The date and time were split up when read in above.  I am not concerned with the time at this point but do want the date to be cast to a date type.

df$date=as.Date(df$date,”[%d/%b/%Y”)


To see the column names and first few rows of our data frame…
head(df)

There are a number of different ways of getting a quick handle on the data – you could do a summary for instance.  One item that you might want to have is a the number of requests for HTTP status.

table(df$status)
 

But the item of immediate interest is simply the number of requests.  The following will provide the number of requests by date.
reqs=as.data.frame(table(df$date))

R is really great for these quick summarizations, and if you memorize a few functions you will be able to address most needs easily.  At a certain point, I can better visualize data problems using SQL, and so use the sqldf library.  For now – on to some charts using ggplot2.

Make Some Charts


One “gotcha” that I hit fairly often with R and ggplot2 is the need to cast variables in a way that allows them to be treated as either continuous or discrete.  In the following casting the Var1 field as a Date allows it to be treated as continuous and geom_line() renders a line as intended.

ggplot(data=reqs, aes(x=as.Date(Var1), y=Freq)) + geom_line() + xlab(‘Date’) + ylab(‘Requests’) + opts(title=’Traffic to Site’)




On the other hand, the format function is used in this example to cause the (http) status value to be treated as discrete.

ggplot(data=df, aes(x=format(status))) + geom_bar() + xlab(‘Status’) + ylab(‘Count’) + opts(title=’Status’)


By the way, the images were exported as pngs for the blog by assigning the chart to a variable p and printing like so:



png(“imagename.png”)
print(p)
dev.off()

So there you have it – functional, useful R that addresses a practical every day need of web developers.  It is also a great, practical task that can introduce you to R with a simple relevant exercise that provides immediate value.

The next time Google Analytics falls short, pull out R and give it a try!



To leave a comment for the author, please follow the link and comment on their blog: R-Chart.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.