Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Series of Azure Databricks posts:
- Dec 01: What is Azure Databricks
- Dec 02: How to get started with Azure Databricks
- Dec 03: Getting to know the workspace and Azure Databricks platform
- Dec 04: Creating your first Azure Databricks cluster
- Dec 05: Understanding Azure Databricks cluster architecture, workers, drivers and jobs
- Dec 06: Importing and storing data to Azure Databricks
- Dec 07: Starting with Databricks notebooks and loading data to DBFS
Yesterday we worked toward using notebooks and how to read data using notebooks.
Today we will check Databricks CLI and look into how you can use CLI to upload (copy) files from your remote server to DBFS.
Databricks CLI is a command-line interface (CLI) that provides an easy-to-use interface to the Databricks platform. Databricks CLI is from group of developer tools and should be easy to setup and straightforward to use. You can automate many of the tasks with CLI.
1.Installing the CLI
Using Python 3.6 (or above), run the following pip command in CMD:
pip3 install databricks-cli
But before using CLI, Personal access token needs to be created for authentication.
2. Authentication with Personal Access Token
On your Azure Databricks Workspace home screen go to settings:
And select User settings to get the list of Access Tokens.
Click on Generate New Token and in dialog window, give a token name and lifetime.
After the token is generated, make sure to copy, because you will not be able to see it later. Token can be revoked (when needed), otherwise it has a expiry date (in my case 90 days). So make sure to remember to renew it after the lifetime period!
3. Working with CLI
Go back to CMD and run the following:
databricks --version
will give you the current version you are rocking. After that, let’s configure the connectivity.
databricks configure --token
and you will be prompted to insert two information (!)
- the host ( in my case: https://adb-8606925487212195.15.azuredatabricks.net/)
- the token
Host is is available for you in your browser. Go to Azure databricks tab/Browser and copy paste the URL:
And the token, that has been generated for you in step two. Token should look like: dapib166345f2938xxxxxxxxxxxxxxc.
Once you insert both information, the connection is set!
By using bash commands, now you can work with DBFS from your local machine / server using CLI. For example:
databricks fs ls
will list all the files on root folder of DBFS of your Azure Databricks
4. Uploading file using DBFS CLI
Databricks has already shorthanded / aliased databricks fs command to simply dbfs. Essentially following commands are equivalent:
databricks fs ls dbfs ls
so using DBFS CLI means in otherwords using Databricks FileStore CLI. And with this, we can start copying a file. So copying from my local machine to Azure Databricks should look like:
dbfs cp /mymachine/test_dbfs.txt dbfs:/FileStore/file_dbfs.txt
My complete bash code (as seen on the screen shot) is:
pwd touch test_dbfs.txt dbfs cp test_dbfs.txt dbfs:/FileStore/file_dbfs.txt
And after refreshing the data on my Databricks workspace, you can see that the file is there. Commands pwd and touch are here merely for demonstration.
This approach can be heavily automated for daily data loads to Azure Databricks, delta uploads, data migration or any other data engineering and data movement task. And also note, that Databricks CLI is a powerful tool with broader usage.
Tomorrow we will check how to connect Azure Blob storage with Azure Databricks and how to read data from Blob Storage in Notebooks.
Complete set of code and Notebooks will be available at the Github repository.
Happy Coding and Stay Healthy!
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.