Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
This post is a continuation of my earlier post Big Data-1: Move into the big league:Graduate from Python to Pyspark. While the earlier post discussed parallel constructs in Python and Pyspark, this post elaborates similar and key constructs in R and SparkR. While this post just focuses on the programming part of R and SparkR it is essential to understand and fully grasp the concept of Spark, RDD and how data is distributed across the clusters. This post like the earlier post shows how if you already have a good handle of R, you can easily graduate to Big Data with SparkR
Note 1: This notebook has also been published at Databricks community site Big Data-2: Move into the big league:Graduate from R to SparkR
Note: To upload the CSV to databricks see the video Upload Flat File to Databricks Table
# Read CSV file tendulkar= read.csv("/dbfs/FileStore/tables/tendulkar.csv",stringsAsFactors = FALSE,na.strings=c(NA,"-")) #Check the dimensions of the dataframe dim(tendulkar)
[1] 347 12
# Load the SparkR library library(SparkR) # Initiate a SparkR session sparkR.session() tendulkar1 <- read.df("/FileStore/tables/tendulkar.csv", header = "true", delimiter = ",", source = "csv", inferSchema = "true", na.strings = "") # Check the dimensions of the dataframe dim(tendulkar1)
[1] 347 12
# Get the shape of the dataframe in R dim(tendulkar)
[1] 347 12
The same ‘dim’ command works in SparkR too!
dim(tendulkar1)
[1] 347 12
# Get the names names(tendulkar) # Also colnames(tendulkar)
[1] "Runs" "Mins" "BF" "X4s" "X6s" [6] "SR" "Pos" "Dismissal" "Inns" "Opposition" [11] "Ground" "Start.Date"
names(tendulkar1)
[1] "Runs" "Mins" "BF" "4s" "6s" [6] "SR" "Pos" "Dismissal" "Inns" "Opposition" [11] "Ground" "Start Date"
names(tendulkar)=c('Runs','Minutes','BallsFaced','Fours','Sixes','StrikeRate','Position','Dismissal','Innings','Opposition','Ground','StartDate') names(tendulkar)
[1] "Runs" "Minutes" "BallsFaced" "Fours" "Sixes" [6] "StrikeRate" "Position" "Dismissal" "Innings" "Opposition" [11] "Ground" "StartDate"
names(tendulkar1)=c('Runs','Minutes','BallsFaced','Fours','Sixes','StrikeRate','Position','Dismissal','Innings','Opposition','Ground','StartDate') names(tendulkar1)
[1] "Runs" "Minutes" "BallsFaced" "Fours" "Sixes" [6] "StrikeRate" "Position" "Dismissal" "Innings" "Opposition" [11] "Ground" "StartDate"
summary(tendulkar)
Runs Minutes BallsFaced Fours Length:347 Min. : 1.0 Min. : 0.00 Min. : 0.000 Class :character 1st Qu.: 33.0 1st Qu.: 22.00 1st Qu.: 1.000 Mode :character Median : 82.0 Median : 58.50 Median : 4.000 Mean :125.5 Mean : 89.75 Mean : 6.274 3rd Qu.:181.0 3rd Qu.:133.25 3rd Qu.: 9.000 Max. :613.0 Max. :436.00 Max. :35.000 NA's :18 NA's :19 NA's :19 Sixes StrikeRate Position Dismissal Min. :0.0000 Min. : 0.00 Min. :2.00 Length:347 1st Qu.:0.0000 1st Qu.: 38.09 1st Qu.:4.00 Class :character Median :0.0000 Median : 52.25 Median :4.00 Mode :character Mean :0.2097 Mean : 51.79 Mean :4.24 3rd Qu.:0.0000 3rd Qu.: 65.09 3rd Qu.:4.00 Max. :4.0000 Max. :166.66 Max. :7.00 NA's :18 NA's :20 NA's :18 Innings Opposition Ground StartDate Min. :1.000 Length:347 Length:347 Length:347 1st Qu.:1.000 Class :character Class :character Class :character Median :2.000 Mode :character Mode :character Mode :character Mean :2.376 3rd Qu.:3.000 Max. :4.000 NA's :1
summary(tendulkar1)
SparkDataFrame[summary:string, Runs:string, Minutes:string, BallsFaced:string, Fours:string, Sixes:string, StrikeRate:string, Position:string, Dismissal:string, Innings:string, Opposition:string, Ground:string, StartDate:string]
str(tendulkar)
'data.frame': 347 obs. of 12 variables: $ Runs : chr "15" "DNB" "59" "8" ... $ Minutes : int 28 NA 254 24 124 74 193 1 50 324 ... $ BallsFaced: int 24 NA 172 16 90 51 134 1 44 266 ... $ Fours : int 2 NA 4 1 5 5 6 0 3 5 ... $ Sixes : int 0 NA 0 0 0 0 0 0 0 0 ... $ StrikeRate: num 62.5 NA 34.3 50 45.5 ... $ Position : int 6 NA 6 6 7 6 6 6 6 6 ... $ Dismissal : chr "bowled" NA "lbw" "run out" ... $ Innings : int 2 4 1 3 1 1 3 2 3 1 ... $ Opposition: chr "v Pakistan" "v Pakistan" "v Pakistan" "v Pakistan" ... $ Ground : chr "Karachi" "Karachi" "Faisalabad" "Faisalabad" ... $ StartDate : chr "15-Nov-89" "15-Nov-89" "23-Nov-89" "23-Nov-89" ...
str(tendulkar1)
'SparkDataFrame': 12 variables: $ Runs : chr "15" "DNB" "59" "8" "41" "35" $ Minutes : chr "28" "-" "254" "24" "124" "74" $ BallsFaced: chr "24" "-" "172" "16" "90" "51" $ Fours : chr "2" "-" "4" "1" "5" "5" $ Sixes : chr "0" "-" "0" "0" "0" "0" $ StrikeRate: chr "62.5" "-" "34.3" "50" "45.55" "68.62" $ Position : chr "6" "-" "6" "6" "7" "6" $ Dismissal : chr "bowled" "-" "lbw" "run out" "bowled" "lbw" $ Innings : chr "2" "4" "1" "3" "1" "1" $ Opposition: chr "v Pakistan" "v Pakistan" "v Pakistan" "v Pakistan" "v Pakistan" "v Pakistan" $ Ground : chr "Karachi" "Karachi" "Faisalabad" "Faisalabad" "Lahore" "Sialkot" $ StartDate : chr "15-Nov-89" "15-Nov-89" "23-Nov-89" "23-Nov-89" "1-Dec-89" "9-Dec-89"
print(head(tendulkar),3) print(tail(tendulkar),3)
Runs Minutes BallsFaced Fours Sixes StrikeRate Position Dismissal Innings 1 15 28 24 2 0 62.50 6 bowled 2 2 DNB NA NA NA NA NA NA 4 3 59 254 172 4 0 34.30 6 lbw 1 4 8 24 16 1 0 50.00 6 run out 3 5 41 124 90 5 0 45.55 7 bowled 1 6 35 74 51 5 0 68.62 6 lbw 1 Opposition Ground StartDate 1 v Pakistan Karachi 15-Nov-89 2 v Pakistan Karachi 15-Nov-89 3 v Pakistan Faisalabad 23-Nov-89 4 v Pakistan Faisalabad 23-Nov-89 5 v Pakistan Lahore 1-Dec-89 6 v Pakistan Sialkot 9-Dec-89 Runs Minutes BallsFaced Fours Sixes StrikeRate Position Dismissal Innings 342 37 125 81 5 0 45.67 4 caught 2 343 21 71 23 2 0 91.30 4 run out 4 344 32 99 53 5 0 60.37 4 lbw 2 345 1 8 5 0 0 20.00 4 lbw 4 346 10 41 24 2 0 41.66 4 lbw 2 347 74 150 118 12 0 62.71 4 caught 2 Opposition Ground StartDate 342 v Australia Mohali 14-Mar-13 343 v Australia Mohali 14-Mar-13 344 v Australia Delhi 22-Mar-13 345 v Australia Delhi 22-Mar-13 346 v West Indies Kolkata 6-Nov-13 347 v West Indies Mumbai 14-Nov-13
head(tendulkar1,3)
Runs Minutes BallsFaced Fours Sixes StrikeRate Position Dismissal Innings 1 15 28 24 2 0 62.5 6 bowled 2 2 DNB - - - - - - - 4 3 59 254 172 4 0 34.3 6 lbw 1 Opposition Ground StartDate 1 v Pakistan Karachi 15-Nov-89 2 v Pakistan Karachi 15-Nov-89 3 v Pakistan Faisalabad 23-Nov-89
sapply(tendulkar,class)
Runs Minutes BallsFaced Fours Sixes StrikeRate "character" "integer" "integer" "integer" "integer" "numeric" Position Dismissal Innings Opposition Ground StartDate "integer" "character" "integer" "character" "character" "character"
printSchema(tendulkar1)
root |-- Runs: string (nullable = true) |-- Minutes: string (nullable = true) |-- BallsFaced: string (nullable = true) |-- Fours: string (nullable = true) |-- Sixes: string (nullable = true) |-- StrikeRate: string (nullable = true) |-- Position: string (nullable = true) |-- Dismissal: string (nullable = true) |-- Innings: string (nullable = true) |-- Opposition: string (nullable = true) |-- Ground: string (nullable = true) |-- StartDate: string (nullable = true)
library(dplyr) df=select(tendulkar,Runs,BallsFaced,Minutes) head(df,5)
Runs BallsFaced Minutes 1 15 24 28 2 DNB NA NA 3 59 172 254 4 8 16 24 5 41 90 124
library(SparkR) Sys.setenv(SPARK_HOME="/usr/hdp/2.6.0.3-8/spark") .libPaths(c(file.path(Sys.getenv("SPARK_HOME"), "R", "lib"), .libPaths())) # Initiate a SparkR session sparkR.session() tendulkar1 <- read.df("/FileStore/tables/tendulkar.csv", header = "true", delimiter = ",", source = "csv", inferSchema = "true", na.strings = "") df=SparkR::select(tendulkar1, "Runs", "BF","Mins") head(SparkR::collect(df))
Runs BF Mins 1 15 24 28 2 DNB - - 3 59 172 254 4 8 16 24 5 41 90 124 6 35 51 74
library(dplyr) df=tendulkar %>% filter(Runs > 50) head(df,5)
Runs Minutes BallsFaced Fours Sixes StrikeRate Position Dismissal Innings 1 DNB NA NA NA NA NA NA 4 2 59 254 172 4 0 34.30 6 lbw 1 3 8 24 16 1 0 50.00 6 run out 3 4 57 193 134 6 0 42.53 6 caught 3 5 88 324 266 5 0 33.08 6 caught 1 Opposition Ground StartDate 1 v Pakistan Karachi 15-Nov-89 2 v Pakistan Faisalabad 23-Nov-89 3 v Pakistan Faisalabad 23-Nov-89 4 v Pakistan Sialkot 9-Dec-89 5 v New Zealand Napier 9-Feb-90
df=SparkR::filter(tendulkar1, tendulkar1$Runs > 50) head(SparkR::collect(df))
Runs Mins BF 4s 6s SR Pos Dismissal Inns Opposition Ground 1 59 254 172 4 0 34.3 6 lbw 1 v Pakistan Faisalabad 2 57 193 134 6 0 42.53 6 caught 3 v Pakistan Sialkot 3 88 324 266 5 0 33.08 6 caught 1 v New Zealand Napier 4 68 216 136 8 0 50 6 caught 2 v England Manchester 5 114 228 161 16 0 70.8 4 caught 2 v Australia Perth 6 111 373 270 19 0 41.11 4 caught 2 v South Africa Johannesburg Start Date 1 23-Nov-89 2 9-Dec-89 3 9-Feb-90 4 9-Aug-90 5 1-Feb-92 6 26-Nov-92
unique(tendulkar$Runs)
[1] "15" "DNB" "59" "8" "41" "35" "57" "0" "24" "88" [11] "5" "10" "27" "68" "119*" "21" "11" "16" "7" "40" [21] "148*" "6" "17" "114" "111" "1" "73" "50" "9*" "165" [31] "78" "62" "TDNB" "28" "104*" "71" "142" "96" "43" "11*" [41] "34" "85" "179" "54" "4" "0*" "52*" "2" "122" "31" [51] "177" "74" "42" "18" "61" "36" "169" "9" "15*" "92" [61] "83" "143" "139" "23" "148" "13" "155*" "79" "47" "113" [71] "67" "136" "29" "53" "124*" "126*" "44*" "217" "116" "52" [81] "45" "97" "20" "39" "201*" "76" "65" "126" "36*" "69" [91] "155" "22*" "103" "26" "90" "176" "117" "86" "12" "193" [101] "16*" "51" "32" "55" "37" "44" "241*" "60*" "194*" "3" [111] "32*" "248*" "94" "22" "109" "19" "14" "28*" "63" "64" [121] "101" "122*" "91" "82" "56*" "154*" "153" "49" "10*" "103*" [131] "160" "100*" "105*" "100" "106" "84" "203" "98" "38" "214" [141] "53*" "111*" "146" "14*" "56" "80" "25" "81" "13*"
head(SparkR::distinct(tendulkar1[,"Runs"]),5)
Runs 1 119* 2 7 3 51 4 169 5 32*
library(dplyr) library(magrittr) a <- tendulkar$Runs != "DNB" tendulkar <- tendulkar[a,] dim(tendulkar) # Remove rows with 'TDNB' c <- tendulkar$Runs != "TDNB" tendulkar <- tendulkar[c,] # Remove rows with absent d <- tendulkar$Runs != "absent" tendulkar <- tendulkar[d,] dim(tendulkar) # Remove the "* indicating not out tendulkar$Runs <- as.numeric(gsub("\\*","",tendulkar$Runs)) c <- complete.cases(tendulkar) #Subset the rows which are complete tendulkar <- tendulkar[c,] print(dim(tendulkar)) df <-tendulkar %>% group_by(Ground) %>% summarise(meanRuns= mean(Runs), minRuns=min(Runs), maxRuns=max(Runs)) #names(tendulkar) head(df)
[1] 327 12 # A tibble: 6 x 4 Ground meanRuns minRuns maxRuns 1 Adelaide 32.6 0. 153. 2 Ahmedabad 40.1 4. 217. 3 Auckland 5.00 5. 5. 4 Bangalore 57.9 4. 214. 5 Birmingham 46.8 1. 122. 6 Bloemein 85.0 15. 155.
sparkR.session() tendulkar1 <- read.df("/FileStore/tables/tendulkar.csv", header = "true", delimiter = ",", source = "csv", inferSchema = "true", na.strings = "") print(dim(tendulkar1)) tendulkar1 <-SparkR::filter(tendulkar1,tendulkar1$Runs != "DNB") print(dim(tendulkar1)) tendulkar1<-SparkR::filter(tendulkar1,tendulkar1$Runs != "TDNB") print(dim(tendulkar1)) tendulkar1<-SparkR::filter(tendulkar1,tendulkar1$Runs != "absent") print(dim(tendulkar1)) # Cast the string type Runs to double withColumn(tendulkar1, "Runs", cast(tendulkar1$Runs, "double")) head(SparkR::distinct(tendulkar1[,"Runs"]),20) # Remove the "* indicating not out tendulkar1$Runs=SparkR::regexp_replace(tendulkar1$Runs, "\\*", "") head(SparkR::distinct(tendulkar1[,"Runs"]),20) df=SparkR::summarize(SparkR::groupBy(tendulkar1, tendulkar1$Ground), mean = mean(tendulkar1$Runs), minRuns=min(tendulkar1$Runs),maxRuns=max(tendulkar1$Runs)) head(df,20)
[1] 347 12 [1] 330 12 [1] 329 12 [1] 329 12 Ground mean minRuns maxRuns 1 Bangalore 54.312500 0 96 2 Adelaide 32.600000 0 61 3 Colombo (PSS) 37.200000 14 71 4 Christchurch 12.000000 0 24 5 Auckland 5.000000 5 5 6 Chennai 60.625000 0 81 7 Centurion 73.500000 111 36 8 Brisbane 7.666667 0 7 9 Birmingham 46.750000 1 40 10 Ahmedabad 40.125000 100 8 11 Colombo (RPS) 143.000000 143 143 12 Chittagong 57.800000 101 36 13 Cape Town 69.857143 14 9 14 Bridgetown 26.000000 0 92 15 Bulawayo 55.000000 36 74 16 Delhi 39.947368 0 76 17 Chandigarh 11.000000 11 11 18 Bloemein 85.000000 15 155 19 Colombo (SSC) 77.555556 104 8 20 Cuttack 2.000000 2 2
sparkR.session() tendulkar1 <- read.df("/FileStore/tables/tendulkar.csv", header = "true", delimiter = ",", source = "csv", inferSchema = "true", na.strings = "") # Register this SparkDataFrame as a temporary view. createOrReplaceTempView(tendulkar1, "tendulkar2") # SQL statements can be run by using the sql method df=SparkR::sql("SELECT * FROM tendulkar2 WHERE Ground='Karachi'") head(df)
Runs Mins BF 4s 6s SR Pos Dismissal Inns Opposition Ground Start Date 1 15 28 24 2 0 62.5 6 bowled 2 v Pakistan Karachi 15-Nov-89 2 DNB - - - - - - - 4 v Pakistan Karachi 15-Nov-89 3 23 49 29 5 0 79.31 4 bowled 2 v Pakistan Karachi 29-Jan-06 4 26 74 47 5 0 55.31 4 bowled 4 v Pakistan Karachi 29-Jan-06
This post discusses some of the key constructs in R and SparkR and how one can transition from R to SparkR fairly easily. I will be adding more constructs later. Do check back!
You may also like
1. Exploring Quantum Gate operations with QCSimulator
2. Deep Learning from first principles in Python, R and Octave – Part 4
3. A Bluemix recipe with MongoDB and Node.js
4. Practical Machine Learning with R and Python – Part 5
5. Introducing cricketr! : An R package to analyze performances of cricketers
To see all posts click Index of posts
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.