Advent of 2021, Day 11 – Working with packages and spark dataFrames

tomaztsql

3 months ago

[This article was first published on R – TomazTsql, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Series of Apache Spark posts:

Dec 01: What is Apache Spark
Dec 02: Installing Apache Spark
Dec 03: Getting around CLI and WEB UI in Apache Spark
Dec 04: Spark Architecture – Local and cluster mode
Dec 05: Setting up Spark Cluster
Dec 06: Setting up IDE
Dec 07: Starting Spark with R and Python
Dec 08: Creating RDD files
Dec 09: RDD Operations
Dec 10: Working with data frames

When you install Spark, the extension of not only languages but also other packages, systems is huge. For example with R, not only that you can harvest the capabilities of distributed and parallel computations, you can also extend the use of R language.

R

Variety of extensions are available from CRAN repository or from Github. Spark with flint, spark with Avro, Spark with EMR and many more. For data analysis and machine learning, you can take for example: sparktf (with Tensor flow), xgboost (compatible for Spark), geospark for working with geospatial data, spark for R on Google Cloud, and many omre. A simple way to start is to install extensions:

library(sparkextension)
library(sparklyr)

sc <- spark_connect(master = "spark://192.168.0.184:7077")

and set it to master and I can have all additional packages installed on Spark master.

Futhermore, rsparkling extension, gives you even more capabilities and enables you to use H2O in Spark with R.

Downloading from the cloud and installing the H2O and rsparkling.

install.packages("h2o", type = "source",
  repos = "http://h2o-release.s3.amazonaws.com/h2o/rel-yates/5/R")
install.packages("rsparkling", type = "source",
  repos = "http://h2o-release.s3.amazonaws.com/sparkling-water/rel-2.3/31/R")

And working with H2O is besides defining the:

library(rsparkling)
library(sparklyr)
library(h2o)

sc <- spark_connect(master = "local", version = "2.3", config = list(sparklyr.connect.timeout = 120))

#getting data
iris_spark <- copy_to(sc, iris)

#converting to h2o on spark dataframe
iris_spark_h2o <- as_h2o_frame(sc, iris_spark)

Python

With Python, the extensibility is also rich as with R, and There are even more packages available. Python also has the extenstion called pySparkling with H2O Python packages.

pip install h2o_pysparkling_2.2
pip install requests
pip install tabulate
pip install future

And running a cluster:

from pysparkling import *

import h2o
hc = H2OContext.getOrCreate()

And passing the spark dataframe to H2O

import h2o
frame = h2o.import_file("https://raw.githubusercontent.com/h2oai/sparkling-water/master/examples/smalldata/prostate/prostate.csv")
sparkDF = hc.asSparkFrame(frame)
sparkDF = sparkDF.withColumn("CAPSULE", sparkDF.CAPSULE.cast("string"))
[trainingDF, testingDF] = sparkDF.randomSplit([0.8, 0.2])

And you can start working with anything relating to dataframes, machine learning and more.

from pysparkling.ml import H2OAutoML
automl = H2OAutoML(labelCol="CAPSULE", ignoredCols=["ID"])

Tomorrow we will look Spark SQL and how to get on-board.

Compete set of code, documents, notebooks, and all of the materials will be available at the Github repository: https://github.com/tomaztk/Spark-for-data-engineers

Happy Spark Advent of 2021!

To leave a comment for the author, please follow the link and comment on their blog: R – TomazTsql.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.