Exploring SparkR

Alvaro "Blag" Tejada Galindo

7 years ago

[This article was first published on Blag's bag of rants, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

A colleague from work, asked me to investigate about Spark and R. So the most obvious thing to was to investigate about SparkR -;)

I installed Scala, Hadoop, Spark and SparkR…not sure Hadoop is needed for this…but I wanted to have the full picture -:)

Anyway…I came across a piece of code that reads lines from a file and count how many lines have a “a” and how many lines have a “b”…

For this code I used the lyrics of Girls Not Grey by AFI…

SparkR.R
library(SparkR) start.time <- Sys.time() sc <- sparkR.init(master="local") logFile <- "/home/blag/R_Codes/Girls_Not_Grey" logData <- SparkR:::textFile(sc, logFile) numAs <- count(SparkR:::filterRDD(logData, function(s) { grepl("a", s) })) numBs <- count(SparkR:::filterRDD(logData, function(s) { grepl("b", s) })) paste("Lines with a: ", numAs, ", Lines with b: ", numBs, sep="") end.time <- Sys.time() time.taken <- end.time - start.time time.taken

SparkR.R

library(SparkR)

start.time <- Sys.time()
sc <- sparkR.init(master="local")
logFile <- "/home/blag/R_Codes/Girls_Not_Grey"
logData <- SparkR:::textFile(sc, logFile)
numAs <- count(SparkR:::filterRDD(logData, function(s) { grepl("a", s) }))
numBs <- count(SparkR:::filterRDD(logData, function(s) { grepl("b", s) }))
paste("Lines with a: ", numAs, ", Lines with b: ", numBs, sep="")
end.time <- Sys.time()
time.taken <- end.time - start.time
time.taken

0.3167355 seconds…pretty fast…I wonder how regular R will behave?

PlainR.R
library("stringr") start.time <- Sys.time() logFile <- "/home/blag/R_Codes/Girls_Not_Grey" logfile<-read.table(logFile,header = F, fill = T) logfile<-apply(logfile[,], 1, function(x) paste(x, collapse=" ")) df<-data.frame(lines=logfile) a<-sum(apply(df,1,function(x) grepl("a",x))) b<-sum(apply(df,1,function(x) grepl("b",x))) paste("Lines with a: ", a, ", Lines with b: ", b, sep="") end.time <- Sys.time() time.taken <- end.time - start.time time.taken

PlainR.R

library("stringr")

start.time <- Sys.time()
logFile <- "/home/blag/R_Codes/Girls_Not_Grey"
logfile<-read.table(logFile,header = F, fill = T)
logfile<-apply(logfile[,], 1, function(x) paste(x, collapse=" "))
df<-data.frame(lines=logfile)
a<-sum(apply(df,1,function(x) grepl("a",x)))
b<-sum(apply(df,1,function(x) grepl("b",x)))
paste("Lines with a: ", a, ", Lines with b: ", b, sep="")
end.time <- Sys.time()
time.taken <- end.time - start.time
time.taken

Nice…0.01522398 seconds…wait…what? Isn’t Spark supposed to be pretty fast? Well…I remembered that I read somewhere that Spark shines with big files…

Well…I prepared a file with 5 columns and 1 million records…let’s see how that goes…

SparkR.R
library(SparkR) start.time <- Sys.time() sc <- sparkR.init(master="local") logFile <- "/home/blag/R_Codes/Doc_Header.csv" logData <- SparkR:::textFile(sc, logFile) numAs <- count(SparkR:::filterRDD(logData, function(s) { grepl("a", s) })) numBs <- count(SparkR:::filterRDD(logData, function(s) { grepl("b", s) })) paste("Lines with a: ", numAs, ", Lines with b: ", numBs, sep="") end.time <- Sys.time() time.taken <- end.time - start.time time.taken

SparkR.R

library(SparkR)

start.time <- Sys.time()
sc <- sparkR.init(master="local")
logFile <- "/home/blag/R_Codes/Doc_Header.csv"
logData <- SparkR:::textFile(sc, logFile)
numAs <- count(SparkR:::filterRDD(logData, function(s) { grepl("a", s) }))
numBs <- count(SparkR:::filterRDD(logData, function(s) { grepl("b", s) }))
paste("Lines with a: ", numAs, ", Lines with b: ", numBs, sep="")
end.time <- Sys.time()
time.taken <- end.time - start.time
time.taken

26.45734 seconds for a million records? Nice job -:) Let’s see if plain R wins again…

PlainR.R
library("stringr") start.time <- Sys.time() logFile <- "/home/blag/R_Codes/Doc_Header.csv" logfile<-read.csv(logFile,header = F) logfile<-apply(logfile[,], 1, function(x) paste(x, collapse=" ")) df<-data.frame(lines=logfile) a<-sum(apply(df,1,function(x) grepl("a",x))) b<-sum(apply(df,1,function(x) grepl("b",x))) paste("Lines with a: ", a, ", Lines with b: ", b, sep="") end.time <- Sys.time() time.taken <- end.time - start.time time.taken

PlainR.R

library("stringr")

start.time <- Sys.time()
logFile <- "/home/blag/R_Codes/Doc_Header.csv"
logfile<-read.csv(logFile,header = F)
logfile<-apply(logfile[,], 1, function(x) paste(x, collapse=" "))
df<-data.frame(lines=logfile)
a<-sum(apply(df,1,function(x) grepl("a",x)))
b<-sum(apply(df,1,function(x) grepl("b",x)))
paste("Lines with a: ", a, ", Lines with b: ", b, sep="")
end.time <- Sys.time()
time.taken <- end.time - start.time
time.taken

48.31641 seconds? Look like Spark was almost twice as fast this time…and this is a pretty simple example…I’m sure that when complexity arises…the gap is even bigger…

And sure…I know that a lot of people can take my plain R code and make it even faster than Spark…but…this is my blog…not theirs -;)

I will come back as soon as I learn more about SparkR -:D

UPDATE

So…I got a couple of comments claiming that read.csv() is too slow…and I should measuring the process not the loading of an csv file…while I don’t agree…because everything is included in the process…I did something as simple as moving the start.time after the csv file is done…let’s see how much of a change this brings…

SparkR

Around 1 second faster…which means that reading the csv was really efficient…

Plain R

Around 6 seconds faster…read.csv is not that good…but…SparkR is almost 50% faster…

HOLLY CRAP UPDATE!

Markus from Spain gave me this code on the comments…I just added a couple of things to make complaint…but…damn…I wish I could code like that in R! -:D Thanks Markus!!!

Markus’s code
logFile <- "/home/blag/R_Codes/Doc_Header.csv" lines <- readLines(logFile) start.time <- Sys.time() a<-sum(grepl("a", lines, fixed=TRUE)) b<-sum(grepl("b", lines, fixed=TRUE)) paste("Lines with a: ", a, ", Lines with b: ", b, sep="") end.time <- Sys.time() time.taken <- end.time - start.time time.taken

Markus’s code

logFile <- "/home/blag/R_Codes/Doc_Header.csv"
lines <- readLines(logFile)
start.time <- Sys.time()
a<-sum(grepl("a", lines, fixed=TRUE))
b<-sum(grepl("b", lines, fixed=TRUE))
paste("Lines with a: ", a, ", Lines with b: ", b, sep="")
end.time <- Sys.time()
time.taken <- end.time - start.time
time.taken

Simply…superb! -:)

Greetings,

Blag.

Development Culture.

To leave a comment for the author, please follow the link and comment on their blog: Blag's bag of rants.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.