Site icon R-bloggers

Advent of 2020, Day 8 – Using Databricks CLI and DBFS CLI for file upload

[This article was first published on R – TomazTsql, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Series of Azure Databricks posts:

Yesterday we worked toward using notebooks and how to read data using notebooks.

Today we will check Databricks CLI and look into how you can use CLI to upload (copy) files from your remote server to DBFS.

Databricks CLI is a command-line interface (CLI) that provides an easy-to-use interface to the Databricks platform. Databricks CLI is from group of developer tools and should be easy to setup and straightforward to use. You can automate many of the tasks with CLI.

1.Installing the CLI

Using Python 3.6 (or above), run the following pip command in CMD:

pip3 install databricks-cli

But before using CLI, Personal access token needs to be created for authentication.

2. Authentication with Personal Access Token

On your Azure Databricks Workspace home screen go to settings:

And select User settings to get the list of Access Tokens.

Click on Generate New Token and in dialog window, give a token name and lifetime.

After the token is generated, make sure to copy, because you will not be able to see it later. Token can be revoked (when needed), otherwise it has a expiry date (in my case 90 days). So make sure to remember to renew it after the lifetime period!

3. Working with CLI

Go back to CMD and run the following:

databricks --version

will give you the current version you are rocking. After that, let’s configure the connectivity.

databricks configure --token

and you will be prompted to insert two information (!)

Host is is available for you in your browser. Go to Azure databricks tab/Browser and copy paste the URL:

And the token, that has been generated for you in step two. Token should look like: dapib166345f2938xxxxxxxxxxxxxxc.

Once you insert both information, the connection is set!

By using bash commands, now you can work with DBFS from your local machine / server using CLI. For example:

databricks fs ls

will list all the files on root folder of DBFS of your Azure Databricks

4. Uploading file using DBFS CLI

Databricks has already shorthanded / aliased databricks fs command to simply dbfs. Essentially following commands are equivalent:

databricks fs ls
dbfs ls

so using DBFS CLI means in otherwords using Databricks FileStore CLI. And with this, we can start copying a file. So copying from my local machine to Azure Databricks should look like:

dbfs cp /mymachine/test_dbfs.txt dbfs:/FileStore/file_dbfs.txt

My complete bash code (as seen on the screen shot) is:

pwd
touch test_dbfs.txt
dbfs cp test_dbfs.txt dbfs:/FileStore/file_dbfs.txt

And after refreshing the data on my Databricks workspace, you can see that the file is there. Commands pwd and touch are here merely for demonstration.

This approach can be heavily automated for daily data loads to Azure Databricks, delta uploads, data migration or any other data engineering and data movement task. And also note, that Databricks CLI is a powerful tool with broader usage.

Tomorrow we will check how to connect Azure Blob storage with Azure Databricks and how to read data from Blob Storage in Notebooks.

Complete set of code and Notebooks will be available at the Github repository.

Happy Coding and Stay Healthy!

To leave a comment for the author, please follow the link and comment on their blog: R – TomazTsql.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.