Advent of 2021, Day 10 – Working with data frames
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Series of Apache Spark posts:
- Dec 01: What is Apache Spark
- Dec 02: Installing Apache Spark
- Dec 03: Getting around CLI and WEB UI in Apache Spark
- Dec 04: Spark Architecture – Local and cluster mode
- Dec 05: Setting up Spark Cluster
- Dec 06: Setting up IDE
- Dec 07: Starting Spark with R and Python
- Dec 08: Creating RDD files
- Dec 09: RDD Operations
We have looked in datasets and seen that a dataset is distributed collection of data. A dataset can be constructer from JVM object and later manipulated with transformation operations (e.g.: filter(), Map(),…). API for these datasets are also available in Scala and in Java. But in both cases of Python and R, you can also access the columns or rows from datasets.
On the other hand, dataframe is organised dataset with named columns. It offers much better optimizations and computations and still resembles a typical table (as we know it from database world). Dataframes can be constructed from arrays or from matrices from variety of files, SQL tables, and datasets (RDDs). Dataframe API is available in all flavours: Java, Scala, R and Python and hence it’s popularity.
Dataframes with R
Start a session and get going:
spark_path <- file.path(spark_home, "bin", "spark-class") # Start cluster manager master node system2(spark_path, "org.apache.spark.deploy.master.Master", wait = FALSE) # Start worker node, find master URL at http://localhost:8080/ system2(spark_path, c("org.apache.spark.deploy.worker.Worker", "spark://192.168.0.184:7077"), wait = FALSE) sparkR.session(appName = "R Dataframe Session", sparkConfig = list("org.apache.spark.deploy.worker.Worker" = "spark://192.168.0.184:7077"))
And start working with dataframe by importing a short and simple json file (copy this and store it to people.json file):
{"name":"Michael", "age":29, "height":188} {"name":"Andy", "age":30, "height":201} {"name":"Justin", "age":19, "height":175} df <- read.json("usr/library/main/resources/people.json") head(df)
And we can do many several transformations:
head(select(df, df$name, df$age + 1)) head(where(df, df$age > 21)) # Count people by age head(count(groupBy(df, "age")))
And also by adding and combining additional packages (e.g.: dplyr, ggplot):
dbplot_histogram(height, age) # adding ggplot library(ggplot2) library(tidyverse) df_new %>% gather(df, age, height) %>% ggplot(aes(df, age, fill = factor(height))) + geom_boxplot()
There are many other functions that can be used with Spark Dataframes API with R. Alternatively, we can also do the same with Python.
Dataframes with Python
Start a session and get going:
from pyspark.sql import SparkSession spark = SparkSession \ .builder \ .appName("Python Dataframe Session API") \ .config("org.apache.spark.deploy.worker.Worker", "spark://192.168.0.184:7077") \ .getOrCreate()
And we can start with importing data into data frame.
df <- read.json("examples/src/main/resources/people.json") head(df) # or show complete dataset showDF(df)
And working with filters and subsetting the dataframe – as it would be a normal numpy/pandas dataframe
df.select(df['name'], df['age'] + 1).show() #filtering by age df.filter(df['age'] > 21).show() #grouping by age and displaying the count df.groupBy("age").count().show()
Tomorrow we will look how to plug the R or Python dataframe with packages and get more out of the data.
Compete set of code, documents, notebooks, and all of the materials will be available at the Github repository: https://github.com/tomaztk/Spark-for-data-engineers
Happy Spark Advent of 2021!
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.