Advent of 2021, Day 12 – Spark SQL
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Series of Apache Spark posts:
- Dec 01: What is Apache Spark
- Dec 02: Installing Apache Spark
- Dec 03: Getting around CLI and WEB UI in Apache Spark
- Dec 04: Spark Architecture – Local and cluster mode
- Dec 05: Setting up Spark Cluster
- Dec 06: Setting up IDE
- Dec 07: Starting Spark with R and Python
- Dec 08: Creating RDD files
- Dec 09: RDD Operations
- Dec 10: Working with data frames
- Dec 11: Working with packages and spark DataFrames
Spark SQL is a one of the Spark modules for structured data processing and analysing. Spark provides Spark SQL and also API for execution of SQL queries. Spark SQL can read data from Hive instance, but also from datasets and dataframe. The communication between Spark SQL and execution engine will always result in a dataset or datafrane.
These formats are interchangeable. So interacting with SQL against result from a different API is possible, respectively. Plugging in the Java JDBD or standard ODBC drivers will also give your SQL interface access to different sources. This unification means that developers can easily switch back and forth between different APIs based on which provides the most natural way to express a given transformation.
With API unification, user can access Spark SQL using Scala spark-shell
, using Python pyspark
or using R sparkR
shell.
Loading data – comparison SQL, R, Python
From previous blogpost, we will use the Parquet file (more info on this file format: here) to see the comparison importing / loading data. The content of the file looks like and is directly available here:
PAR1"&,@AlyssaBen, 0red88, @ \Hexample.avro.User%name%%favorite_color%5favorite_numbers%array<&%nameDH&&P5favorite_color<@&P&ê%(favorite_numbersarray ZZ&ê⁄avro.schema⁄{"type":"record","name":"User","namespace":"example.avro","fields":[{"name":"name","type":"string"},{"name":"favorite_color","type":["string","null"]},{"name":"favorite_numbers","type":{"type":"array","items":"int"}}]}parquet-mr version 1.4.3ÍPAR1
For Python:
df = spark.read.parquet("examples/src/main/resources/users.parquet") (df.write.format("parquet") .option("parquet.bloom.filter.enabled#favorite_color", "true") .option("parquet.bloom.filter.expected.ndv#favorite_color", "1000000") .option("parquet.enable.dictionary", "true") .option("parquet.page.write-checksum.enabled", "false") .save("users_with_options.parquet"))
For R:
df <- read.df("examples/src/main/resources/users.parquet", "parquet") write.parquet(df, "users_with_options.parquet", parquet.bloom.filter.enabled#favorite_color = true, parquet.bloom.filter.expected.ndv#favorite_color = 1000000, parquet.enable.dictionary = true, parquet.page.write-checksum.enabled = false)
And using SQL:
CREATE TABLE users_with_options ( name STRING, favorite_color STRING, favorite_numbers array<integer> ) USING parquet OPTIONS ( `parquet.bloom.filter.enabled#favorite_color` true, `parquet.bloom.filter.expected.ndv#favorite_color` 1000000, parquet.enable.dictionary true, parquet.page.write-checksum.enabled true )
The same file can be directly read from the parquet format, without persisting the content.
For Python:
df = spark.sql("SELECT * FROM parquet.`examples/src/main/resources/users.parquet`")
And for R:
df <- sql("SELECT * FROM parquet.`examples/src/main/resources/users.parquet`")
Tomorrow we will look into further SQL bucketing and partitioning.
Compete set of code, documents, notebooks, and all of the materials will be available at the Github repository: https://github.com/tomaztk/Spark-for-data-engineers
Happy Spark Advent of 2021!
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.