Preparing Big Data for Analysis in R

Joseph Rickert

8 years ago

[This article was first published on Revolutions, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

by Yaniv Mor, Co-founder & CEO of Xplenty

How do you get Big Data ready for R? Gigabytes or terabytes of raw data may need to be combined, cleaned, and aggregated before they can be analyzed. Processing such large amounts of data used to require installing Hadoop on a cluster of servers, not to mention coding MapReduce jobs in Pig or Java. Those days are over.

This post is going to show how raw data can be prepared for analysis in R without any code or server installations. Instead, we’ll use Xplenty’s data integration-as-a-service to design a data flow, create a cluster, and run the job all via a friendly user interface.

For this demo we’ll use 1.5 GB of raw web logs (uncompressed) from the servers that hosted the ”Star Wars Kid” video. A remix of the video was also hosted there as well as the usual affair of HTMLs, images, and more. Here’s an example log line:

208.63.63.94 – – [11/Apr/2003:12:36:39 -0700] "GET /archive/2003/04/03/typo_pop.shtml HTTP/1.1" 200 28361 "http://www.kottke.org/" Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.0; .NET CLR 1.0.3705)"

Log line format:

Source IP/domain
User Identifier (blank)
UserID (blank)
Date – in the format of dd/MMM/yyyy:HH:mm:ss Z
HTTP request – type, URL, HTTP version
HTTP code
Bytes transferred
Referrer
User agent

Let’s say we would only like to analyze requests to the original “Star Wars Kid” video by source IP, date and referrer. Imagine what it would be like to setup the servers and write the code – the hours spent writing and debugging a relatively simple dataflow. Feel the stress building? Let it go. Here’s how such a dataflow looks like in Xplenty:

Let’s take a closer look how it works:

Source – loads the data from Amazon S3 and splits it into fields. The data is publicly available on S3 at xplenty.public/weblogs/star_wars_kid.log.gz. If you’d like to take a look at the data, download it via the web, or create an AWS account and use a tool such as S3Browser to access the above path.

Select – only keeps the ip, date, url, and referrer fields while leaving the rest of the data out. Note that the date also contains the time, and that the request also contains the request type and HTTP version. They are both cleaned in the select component using a regular expression.

Filter – matches Star_Wars_Kid.wmv in the URL field and removes any other log lines.

Destination – stores the results back into Amazon S3.

No setup or installation is needed. Just a few clicks enables you to create a new cluster. Then, one more screen to get the job running.

The results – about 120 MB (uncompressed) log lines of video file requests with IPs, URLs, and referrers that are now ready for analysis. Job running time – about 3 minutes. The full results are available in the xplenty.dumpster bucket at starwarskid/videos.gz. Here are a few sample lines:

66.142.89.235   09/May/2003     /random/video/Star_Wars_Kid.wmv     http://www.waxy.org/
63.195.36.218   09/May/2003     /random/video/Star_Wars_Kid.wmv -
66.27.235.199   09/May/2003     /random/video/Star_Wars_Kid.wmv     http://www.kuro5hin.org/story/2003/5/2/16116/46048
24.81.67.79     09/May/2003     /random/video/Star_Wars_Kid.wmv     http://www.waxy.org/archive/2003/04/29/star_war.shtml
12.149.141.14   09/May/2003     /random/video/Star_Wars_Kid.wmv     http://www.waxy.org/

Now, we can finally analyze the data in R. Here’s sample code which generates a traffic graph by date for Star_Wars_Kid.wmv:

df <- read.table('star-wars-kid.tsv', fill = TRUE)
colnames(df) <- c('ip', 'date', 'url', 'referrer')
df$date <- as.Date(df$date,"%d/%b/%Y")
reqs <- as.data.frame(table(df$date))
ggplot(data=reqs, aes(x=as.Date(Var1), y=Freq)) + geom_line() + xlab('Date') + ylab('Requests') +  theme(title=element_text('Traffic to Star Wars Kid Video'), legend.position='none')

Additional components could easily be added to the dataflow for joining several sources, sorting data, extracting strings with regular expressions, and more. The same dataflow could be used to process even 1.5 TB of data, or a directory that contains many big files. Would you like to prepare your data for analysis in R? Get a free Xplenty account and start crunching your data

To leave a comment for the author, please follow the link and comment on their blog: Revolutions.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.