Site icon R-bloggers

Advent of 2020, Day 6 – Importing and storing data to Azure Databricks

[This article was first published on R – TomazTsql, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Series of Azure Databricks posts:

Yesterday we started exploring the Azure services that are created when using Azure Databricks. One of the service, that I would like to explore today is storage and especially how to import and how to store data.

Log in to Azure Databricks and on the main (home) site select “Create Table” under recommended common task. Don’t start your cluster yet (if it’s running, please terminate it for now).

This will prompt you a variety of actions on importing data to DBFS or connecting Azure Databricks with other services.

Drag the data file (available on Github in data folder) named Day6data.csv to square for upload. For easier understanding, let’s check the CSV file schema (simple one, three columns: 1. Date (datetime format), 2. Temperature (integer format), 3. City (string format)).

But before you start with uploading the data, let’s check the Azure resource group. I have not yet started any Databricks cluster in my workspace. And here you can see that Vnet, Storage and Network Security group will always be available for Azure Databricks service. Only when you start the cluster, additional services (IP addresses, disks, VM,…) will appear.

This gives us better idea where and how data is persisted. Your data will always be available and stored on blob storage. Meaning, even if you decide – not only to terminate the cluster, but to delete the cluster as well, your data will always be safely stored. Only when you add new cluster to same workspace, cluster will automatically retrieved the data from blob storage.

  1. Import

Drag and drop the csv file in the “Drop zone” as discussed previously. And is should looked like this:

You have now two options:

Select the “Create table with UI”. Only now you will be asked to select the cluster:

Now select the “Create table in Notebook” and Databricks will create a first Notebook for you using Spark language to upload the data to DBFS.

In case I want to run this notebook, I will need to have my cluster up and running. So let’s start a cluster. On your left vertical navigation bar, select Cluster Icon. You will get the list of all the clusters you are using. Select the one we have created on Day 4.

If you want, check the resource group for your Azure Databricks to see all the running VM, disks and VNets.

Now insert the data using the import method, by drag and drop the CSV file in the “Drop Zone” (repeat the process) and hit “Create Table with UI”. Now you should have Cluster available. Select it and preview the Table.

You can see that table name is propagated from filename, the file Type is automatically selected, Column delimiter is automatically selected. Only “First row in header” should be selected in order to have columns properly named and data types corrected, respectively.

Now we can create a table. After Databricks will finish, the report will be presented with recap of the table location (yes, location!), Schema and overview of sample data.

This table is now available on my Cluster. What does this mean? This table is now persistent on your cluster, but not only on cluster, but on your Azure Databricks Workspace. This is important to understand how and where data is stored. Go to Data icon on left vertical navigation bar.

This database is attached to my Cluster. If I terminate my cluster, will I loose my data? Trying stoping the cluster and check data again. And bam… Database is not available, since there is no cluster “attached” to it.

But hold your horses. Data is still available on blob storage, just not seen to DBFS. Database will be visible again, when you start your cluster.

2. Storing data to DBFS

DBFS – Databricks File System is a distrubuted file system mounted into an enclosed Azure Databricks workspace. DBFS is available on selected cluster through UI or Notebooks. In this way, DBFS is decoupled data layer (or abstraction layer) on top of Azure object storage

is a distributed file system mounted into an Azure Databricks workspace and available on Azure Databricks clusters. DBFS is an abstraction on top of scalable object storage and offers the following benefits:

Storage is located as root and there are some folders created with following locations:

You will find many other folder that will be generated though notebooks.

Before we begin, let’s make your life easier. Go to admin console setting, select advanced tab and find “DBFS File browser“. By default, this option is disabled, so let’s enable it.

This will enable you to view the data through DBFS structure, give you the upload option and search option.

Uploading files will be now easier and would be seen immediately in FileStore. There is same file prefixed Day6Data_dbfs.csv in github data folder, that you can upload manually and it would be seen in FileStore:

Tomorrow we will explore how we can use Notebook to access this file in different commands (CLI, Bash, Utils, Python, R, Spark). And since we will be using notebooks for the first time, we will do a little exploration of notebooks as well.

Complete set of code and Notebooks will be available at the Github repository.

Happy Coding and Stay Healthy!

To leave a comment for the author, please follow the link and comment on their blog: R – TomazTsql.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.