Site icon R-bloggers

News from UseR!2015 – the RHadoop tutorial

[This article was first published on Revolutions, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

by Andrie de Vries

Today is the first day of UseR!2015 conference in Aalborg in Northern Denmark.  But yesterday was a day packed with 16 tutorials on a range of interesting topics.  I submitted a proposal many months ago to run a session on using R in Hadoop and was very happy to selected to run a session in the morning.

When we first started planning the session, we set a Big Hairy Audacious Goal to run the session using a HortonWorks Hadoop cluster hosted in the Microsoft Azure cloud.

We trialled the session at the Birmingham R user group during May, and then again last week during a Microsoft internal webinar. In both cases, the cluster performed great.

Then I asked the organisers how many people expressed an interest, and I heard that 66 people said they might come and that the room will seat 144 people.

At this point I started having cold sweats!

Just as luck would have it, minutes before start time something went wrong. The cluster would not start and participants were unable to access the server for the first hour of the tutorial. Fortunately, we had a back-up plan. 

You can take a vicarious tour through the analysis of the NYC Taxi Cab data with the rmarkdown slides built with the RStudio presentation suite.

Although I did not cover all the material during the session, the full set of presentations are:

1. Using R with Hadoop

2. Analysing New York taxi data with Hadoop

3. Computing on distributed matrices

4. Using RHive to connect to the hive database

(These presentations, sample data, all the scripts and examples live in github at https://github.com/andrie/RHadoop-tutorial. Use the tag https://github.com/andrie/RHadoop-tutorial/tree/2015-06-30-UseR!2015 to find the state of the repository at the time of the tutorial.)

The central example throughout is to use mapreduce() in the package rmr2 to summarize new York taxi journeys, grouped by hour of the day. This is a dataset of ~200GB of uncompressed csv files.

You can read more about how this data came into the public domain at http://chriswhong.com/open-data/foil_nyc_taxi/. Here is a simple plot of the results, sampled from a 1 in a 1000 sample of data:

  

The full code to do this is at available in a github repository at https://github.com/andrie/RHadoop-tutorial. Here is an extract:

library(rmr2)
library(rhdfs)
hdfs.init()

rmr.options(backend = "hadoop")

hdfs.ls("taxi")$file
homeFolder <- file.path("/user", Sys.getenv("USER"))
taxi.hdp <- file.path(homeFolder, "taxi")


headerInfo <- read.csv("data/dictionary_trip_data.csv", stringsAsFactors = FALSE)
colClasses <- as.character(as.vector(headerInfo[1, ]))

taxi.format <- make.input.format(format = "csv", sep = ",",
col.names = names(headerInfo),
colClasses = colClasses,
stringsAsFactors = FALSE
)

taxi.map <- function(k, v){
original <- v[[6]]
date <- as.Date(original, origin = "1970-01-01")
wkday <- weekdays(date)
hour <- format(as.POSIXct(original), "%H")
dat <- data.frame(date, hour)
z <- aggregate(date ~ hour, dat, FUN = length)
keyval(z[[1]], z[[2]])
}

taxi.reduce <- function(k, v){
data.frame(hour = k, trips = sum(v), row.names = k)
}

m <- mapreduce(taxi.hdp, input.format = taxi.format,
map = taxi.map,
reduce = taxi.reduce
)

dat <- values(from.dfs(m))

library("ggplot2")
p <- ggplot(dat, aes(x = hour, y = trips, group = 1)) +
geom_smooth(method = loess, span = 0.5,
col = "grey50", fill = "yellow") +
geom_line(col = "blue") +
expand_limits(y = 0) +
ggtitle("Sample of taxi trips in New York")

p

To leave a comment for the author, please follow the link and comment on their blog: Revolutions.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.