Site icon R-bloggers

Advent of 2020, Day 12 – Using Azure Databricks Notebooks with Python Language for data analytics

[This article was first published on R – TomazTsql, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Series of Azure Databricks posts:

We looked into SQL and R language and how to get some basic data preparation done. Today we will look into Python and how to go about data analytics.

Using data frames and getting data from SQL and DBFS

Create a new notebook with Name: Day12_Py_Analytics and Language: Python and connect notebook to a cluster we have created on Day 4. And let’s go and add some data from FileStore and use data that we uploaded in Day 6.

csv_df = spark.read.csv("/FileStore/Day6Data_dbfs.csv", header="True")
display(csv_df)

We can also import data from SQL Table into data frame by simply writing an SQL statement.

#from pyspark.sql.functions import explode
from pyspark.sql import *
import pandas as pd

display(sql("select * from day10.temperature"))

Besides displaying dataset, you can store a result of a query to a variable and use it later.

#for display
display(sql("select * from day10.temperature"))

#to save to variable
df = sql("select * from day10.temperature")

Let’s get now some data from Databricks sample data (that is available to anybody). So you can insert data from dbfs store and use the sample datasets as well, by using Python Pandas.

import pandas as pd

dfcovid = pd.read_csv("/dbfs/databricks-datasets/COVID/covid-19-data/us-states.csv")
dfcovid.head()

and now let’s scatter plot some number of cases and deaths per states and use the following Python code that can be simply used in Azure Databricks.

# Filter to 2020-12-01 on first of december
df_12_01 = dfcovid[dfcovid["date"] == "2020-12-01"] 

ax = df_12_01.plot(x="cases", y="deaths", kind="scatter", 
                   figsize=(12,8), s=100, title="Deaths vs Cases on 2020-12-01 - All States")

df_12_01[["cases", "deaths", "state"]].apply(lambda row: ax.text(*row), axis=1);

And now let’s compare only couple of these extreme states (New York, Texas, California and Florida). And create a subset for only these four states:


df_ny_cal_tex_flor = dfcovid[(dfcovid["state"] == "New York") | (dfcovid["state"] == "California") | (dfcovid["state"] == "Florida") | (dfcovid["state"] == "Texas")]

And now to create an index for the plot of deaths over time

df_ny_cal_tex_flor = df_ny_cal_tex_flor.pivot(index='date', columns='state', values='deaths').fillna(0)
df_ny_cal_tex_flor

and now plot this data using this dataset:

df_ny_cal_tex_flor.plot.line(title="Deaths 2020-01-25 to 2020-12-10 - CA, NY, TX, FL", figsize=(12,8))

And now for a simple regression analysis, we will split data from test and train. Since the first and second wave we will need to thing how to split the data. Let’s split it until mid of November and after mid of November.

train_df = dfcovid[(dfcovid["date"] >= "2020-07-01") & (dfcovid["date"] <= "2020-11-15")]
test_df = dfcovid[dfcovid["date"] > "2020-11-16"]

X_train = train_df[["cases"]]
y_train = train_df["deaths"]

X_test = test_df[["cases"]]
y_test = test_df["deaths"]

We will use scikit-learn to do simple linear regression.

from sklearn.linear_model import LinearRegression

lr = LinearRegression().fit(X_train, y_train)
print(f"num_deaths = {lr.intercept_:.4f} + {lr.coef_[0]:.4f}*cases")

So if we have no cases, then there should be no deaths caused by COVID-19; this gives us a base line and assume that let’s set the intercept to be 0.

lr = LinearRegression(fit_intercept=False).fit(X_train, y_train)
print(f"num_deaths = {lr.coef_[0]:.4f}*cases")

This model imposes that there is a 2.68% mortality rate in our dataset. But we know that some states have higher mortality rates and that linear model is absolutely not ideal for that, but it is just to showcase for using Python in Databricks.

Tomorrow we will check and explore how to use Python Koalas to do data engineering, so stay tuned.

Complete set of code and Notebooks will be available at the Github repository.

Happy Coding and Stay Healthy!

To leave a comment for the author, please follow the link and comment on their blog: R – TomazTsql.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.