Fast data exploration for predictive modeling

Pablo Casas

3 years ago

[This article was first published on R - Data Science Heroes Blog, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

The problem: Before modeling, we need to check/change numerical, categorical, NAs, one unique value and high cardinality variables.

The new version of funModeling 1.9.2 was released aimed to have assistance during the prior step in creating machine learning models.

Introduction

data_integrity function provide information about the format of all the variables, as well as some short stats about NA values.

This way we can select and transform the variables, keeping them in the format we need.

# install.packages("funModeling")
library(funModeling)

Load the messy data:

library(tidyverse)
data=read_delim("https://raw.githubusercontent.com/pablo14/data-integrity/master/messy_data.txt", delim = ';')

Now we call to data_integrity function, which returns an integrity object:

di=data_integrity(data)

Then, summary function gives us a quick self-explanatory overview :

summary(di)

## 
## ◌ {Numerical with NA} num_vessels_flour, thal
## ◌ {Categorical with NA} gender
## ● {One unique value} constant

Now we can apply mutate_at, select, or apply other function over certain and specific columns.

In case we need the variable name as a vector of strings, we can use the RStudio bare-combine add-in:

My keyboard shortcut for this lil' function gets quite the workout…
???? "hrbraddins::bare_combine()" by @hrbrmstr https://t.co/8dwqNEso0B #rstats pic.twitter.com/gyqz2mUE0Y
— Mara Averick (@dataandme) July 29, 2019

The high cardinality max value can be changed using the parameter MAX_UNIQUE

Accessing all the information

If we print the integrity object, we can see a lot of information regarding NA, numerical, categorical and other types, alongside the high cardinality variables:

di

## $vars_num_with_NA
##            variable q_na       p_na
## 1 num_vessels_flour    4 0.01320132
## 2              thal    2 0.00660066
## 
## $vars_cat_with_NA
##   variable q_na       p_na
## 1   gender    1 0.00330033
## 
## $vars_cat_high_card
## [1] variable unique  
## <0 rows> (or 0-length row.names)
## 
## $MAX_UNIQUE
## [1] 35
## 
## $vars_one_value
## [1] "constant"
## 
## $vars_cat
## [1] "gender"            "has_heart_disease"
## 
## $vars_num
##  [1] "age"                    "chest_pain"             "resting_blood_pressure"
##  [4] "serum_cholestoral"      "fasting_blood_sugar"    "resting_electro"       
##  [7] "max_heart_rate"         "exer_angina"            "oldpeak"               
## [10] "slope"                  "num_vessels_flour"      "thal"                  
## [13] "heart_disease_severity" "exter_angina"           "constant"              
## [16] "id"                    
## 
## $vars_char
## [1] "gender"            "has_heart_disease"
## 
## $vars_factor
## character(0)
## 
## $vars_other
## [1] "has_heart_disease2" "fecha"              "fecha2"

And each object is accessible to operate quickly:

di$results$vars_num

##  [1] "age"                    "chest_pain"             "resting_blood_pressure"
##  [4] "serum_cholestoral"      "fasting_blood_sugar"    "resting_electro"       
##  [7] "max_heart_rate"         "exer_angina"            "oldpeak"               
## [10] "slope"                  "num_vessels_flour"      "thal"                  
## [13] "heart_disease_severity" "exter_angina"           "constant"              
## [16] "id"

Numerical variables with NA values:

di$results$vars_num_with_NA$variable

## [1] "num_vessels_flour" "thal"

Help page:

help("data_integrity")

New `status` function

This is the internal function used in data_integrity:

status(heart_disease)

##                  variable q_zeros   p_zeros q_na       p_na q_inf p_inf    type unique
## 1                     age       0 0.0000000    0 0.00000000     0     0 integer     41
## 2                  gender       0 0.0000000    0 0.00000000     0     0  factor      2
## 3              chest_pain       0 0.0000000    0 0.00000000     0     0  factor      4
## 4  resting_blood_pressure       0 0.0000000    0 0.00000000     0     0 integer     50
## 5       serum_cholestoral       0 0.0000000    0 0.00000000     0     0 integer    152
## 6     fasting_blood_sugar     258 0.8514851    0 0.00000000     0     0  factor      2
## 7         resting_electro     151 0.4983498    0 0.00000000     0     0  factor      3
## 8          max_heart_rate       0 0.0000000    0 0.00000000     0     0 integer     91
## 9             exer_angina     204 0.6732673    0 0.00000000     0     0 integer      2
## 10                oldpeak      99 0.3267327    0 0.00000000     0     0 numeric     40
## 11                  slope       0 0.0000000    0 0.00000000     0     0 integer      3
## 12      num_vessels_flour     176 0.5808581    4 0.01320132     0     0 integer      4
## 13                   thal       0 0.0000000    2 0.00660066     0     0  factor      3
## 14 heart_disease_severity     164 0.5412541    0 0.00000000     0     0 integer      5
## 15           exter_angina     204 0.6732673    0 0.00000000     0     0  factor      2
## 16      has_heart_disease       0 0.0000000    0 0.00000000     0     0  factor      2

It’s another version of df_status, where percentages are expressed in the range o 0 to 1 (not 0 to 100). More intuitive to use in filters

This is the same object as di$status_now.

Next realase?

It will contain, based on data_integrity, an automated data quality test suited for the predictive model we need to run.
Found this task quite important and repetitive when I teach. Hopefully it will save some time!

Introduction

Load the messy data:

Accessing all the information

New status function

Next realase?

Further reading

New `status` function