Advent of 2021, Day 2 – Installing Apache Spark

tomaztsql

4 months ago

[This article was first published on R – TomazTsql, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Series of Apache Spark posts:

Dec 01: What is Apache Spark

Today, we will look into installing Apache Spark. Spark is cross-platform software and therefore, we will look into installing it on both Windows and MacOS.

Windows

Installing Apache Spark on Windows computer will require preinstalled Java JDK (Java Development Kit). Java 8 or later version, with current version 17. On Oracle website, download the Java and install it on your system. Easiest way is to download the x64 MSI Installer. Install the file and follow the instructions. Installer will create a folder like “C:\Program Files\Java\jdk-17.0.1”.

After the installation is completed, proceed with installation of Apache Spark. Download Spark from the Apache Spark website. Select the Spark release 3.2.0 (Oct 13 2021) with package type: Pre-built for Apache Hadoop 3.3 and later.

Click on the file “spark-3.2.0-bin-hadoop3.2.tgz” and it will redirect you to download site.

Create a folder – I am creating C:\SparkApp and unzipping all the content of the tgz file into this folder. The final structure of the folder should be: C:\SparkApp\spark-3.2.0-bin-hadoop3.2.

Furthermore, we need to set the environment variables. You will find them in Control Panel -> System -> About -> Advanced System Settings and go to Advanced Tab and click Environment variables. Add three User variables: SPARK_HOME, HADOOP_HOME, JAVA_HOME

with following values:

SPARK_HOME C:\SparkApp\spark-3.2.0-bin-hadoop3.2.\bin
HADOOP_HOME C:\SparkApp\spark-3.2.0-bin-hadoop3.2.\bin
JAVA_HOME C:\Program Files\Java\jdk-17.0.1\bin

The last part is the download of the Winutil.exe file and paste it to the bin folder of your Spark binary; into: C:\SparkApp\spark-3.2.0-bin-hadoop3.2\bin.

Winutil can be found on Github and I am downloading for Hadoop-3.0.0.

After copying the file, open Command line in your windows machinee. Navigate to C:\SparkApp\spark-3.2.0-bin-hadoop3.2\bin and run command spark-shell. This CLI utlity comes with this distribution of Apache spark. You are ready to start using Spark.

MacOS

With installing Apache Spark on MacOS, most of the installation can be done using CLI.

Presumably, you already have installed BREW. You can always update the brew to latest version:

brew upgrade && brew update

After this is finished, run the java installation

brew install java8
brew install java

Installing xcode is the next step:

xcode-select --install

After this is finished, install scala:

brew install scala

And the final step is to install Spark by typing the following command in CLI:

brew install apache-spark

And run:

brew link --overwrite apache-spark

Finally, to execute the Spark shell, command is the same in Windows as it is in MacOS. Run the following command to start spark shell:

Spark-shell

Spark is up and running on OpenJDK VM with Java 11.

Compete set of code, documents, notebooks, and all of the materials will be available at the Github repository: https://github.com/tomaztk/Spark-for-data-engineers

Tomorrow we will look into the Spark CLI and WEB UI and get to know the environment.

Happy Spark Advent of 2021!

To leave a comment for the author, please follow the link and comment on their blog: R – TomazTsql.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.