Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Introduction
This post will introduction to the healthyR.ai
package. The healthyR.ai
package is a collection of functions that I have developed to help me analyze and visualize data. The package is designed to be easy to use and to provide a wide range of functionality for data analysis. The package is also meant to help and provide some easy boilerplate funcationality for machine learning.
It might be best to view this post in light mode to see the tables better.
< section id="installation" class="level1">Installation
You can install the released version of healthyR.ai from CRAN with:
install.packages("healthyR.ai")
And the development version from GitHub with:
# install.packages("devtools") devtools::install_github("spsanderson/healthyR.ai")< section id="getting-started" class="level1">
Getting Started
< section id="the-goal" class="level2">The Goal
The ultimate goal really is to make it easier to do data analysis and machine learning in R. The package is designed to be easy to use and to provide a wide range of functionality for data analysis. The package is also meant to help and provide some easy boilerplate functionality for machine learning. This package is in its early stages and will be updated frequently.
It also keeps with the same framework of all of the healthyverse
packages in that it is meant for the user to be able to use the package without having to know a lot of R. Many rural hospitals do not have the resources to perform this sort of work, so I am working hard to build these types of things out for them for free.
Let’s go through some examples.
library(healthyR.ai) library(tidyverse) library(DT)
Now let’s get a list of all the functions that are exposed in the package.
# Functions and their arguments for healthyR pat <- c("%>%",":=","as_label","as_name","enquo","enquos","expr", "sym","syms","required_pkgs.step_hai_fourier", "required_pkgs.step_hai_fourier_discrete", "required_pkgs.step_hai_hyperbolic", "required_pkgs.step_hai_scale_zero_one", "required_pkgs.step_hai_scal_zscore", "required_pkgs.step_hai_winsorized_move", "required_pkgs.step_hai_winsorized_truncate") tibble(fns = ls.str("package:healthyR.ai")) |> filter(!fns %in% pat) |> mutate(params = purrr::map(fns, formalArgs)) |> group_by(fns) |> mutate(func_with_params = toString(params)) |> mutate( func_with_params = ifelse( str_detect( func_with_params, "\\("), paste0(fns, func_with_params), paste0(fns, "(", func_with_params, ")") )) |> select(fns, func_with_params) |> mutate(fns = as.factor(fns)) |> datatable( #class = 'cell-boarder-stripe', colnames = c("Function", "Full Call"), options = list( autowidth = TRUE, pageLength = 10 ) )
Examples
Let’s start off going through an example of using the function, pca_your_repipe
. First, the syntax.
Example – PCA a recipe
< section id="syntax" class="level3">Syntax
pca_your_recipe(.recipe_object, .data, .threshold = 0.75, .top_n = 5)< section id="arguments" class="level3">
Arguments
.recipe_object
–.data
– The full data set that is used in the original recipe object passed into .recipe_object in order to obtain the baked data of the transform..threshold
– A number between 0 and 1. A fraction of the total variance that should be covered by the components..top_n
– How many variables loadings should be returned per PC
Value
A list object with several components.
< section id="details" class="level3">Details
This is a simple wrapper around some recipes functions to perform a PCA on a given recipe. This function will output a list and return it invisible. All of the components of the analysis will be returned in a list as their own object that can be selected individually. A scree plot is also included. The items that get returned are:
- pca_transform – This is the pca recipe.
- variable_loadings
- variable_variance
- pca_estimates
- pca_juiced_estimates
- pca_baked_data
- pca_variance_df
- pca_rotattion_df
- pca_variance_scree_plt
- pca_loadings_plt
- pca_loadings_plotly
- pca_top_n_loadings_plt
- pca_top_n_plotly
Working Example
library(rsample) library(recipes) splits <- initial_split(mtcars, prop = 0.8) rec_obj <- recipe(mpg ~ ., data = training(splits)) |> step_normalize(all_predictors()) pca_output <- pca_your_recipe( .recipe_object = rec_obj, .data = mtcars, .threshold = 0.75, .top_n = 5 )
Now let’s check the output:
pca_output
$pca_transform
── Recipe ──────────────────────────────────────────────────────────────────────
── Inputs
Number of variables by role
outcome: 1 predictor: 10
── Operations
• Centering and scaling for: all_predictors()
• Centering for: recipes::all_numeric()
• Scaling for: recipes::all_numeric()
• Sparse, unbalanced variable filter on: recipes::all_numeric()
• PCA extraction with: recipes::all_numeric_predictors()
$variable_loadings # A tibble: 100 × 4 terms value component id <chr> <dbl> <chr> <chr> 1 cyl -0.394 PC1 pca_RSbN6 2 disp -0.389 PC1 pca_RSbN6 3 hp -0.356 PC1 pca_RSbN6 4 drat 0.321 PC1 pca_RSbN6 5 wt -0.358 PC1 pca_RSbN6 6 qsec 0.248 PC1 pca_RSbN6 7 vs 0.319 PC1 pca_RSbN6 8 am 0.248 PC1 pca_RSbN6 9 gear 0.238 PC1 pca_RSbN6 10 carb -0.232 PC1 pca_RSbN6 # ℹ 90 more rows $variable_variance # A tibble: 40 × 4 terms value component id <chr> <dbl> <int> <chr> 1 variance 6.09 1 pca_RSbN6 2 variance 2.42 2 pca_RSbN6 3 variance 0.619 3 pca_RSbN6 4 variance 0.231 4 pca_RSbN6 5 variance 0.215 5 pca_RSbN6 6 variance 0.171 6 pca_RSbN6 7 variance 0.112 7 pca_RSbN6 8 variance 0.0848 8 pca_RSbN6 9 variance 0.0409 9 pca_RSbN6 10 variance 0.0219 10 pca_RSbN6 # ℹ 30 more rows $pca_estimates
── Recipe ──────────────────────────────────────────────────────────────────────
── Inputs
Number of variables by role
outcome: 1 predictor: 10
── Training information
Training data contained 25 data points and no incomplete rows.
── Operations
• Centering and scaling for: cyl, disp, hp, drat, wt, qsec, ... | Trained
• Centering for: cyl, disp, hp, drat, wt, qsec, vs, am, gear, ... | Trained
• Scaling for: cyl, disp, hp, drat, wt, qsec, vs, am, gear, ... | Trained
• Sparse, unbalanced variable filter removed: <none> | Trained
• PCA extraction with: cyl, disp, hp, drat, wt, qsec, vs, am, ... | Trained
$pca_juiced_estimates # A tibble: 25 × 3 mpg PC1 PC2 <dbl> <dbl> <dbl> 1 -1.67 -3.54 -0.529 2 0.0945 0.633 2.03 3 0.394 2.31 -1.60 4 0.178 1.88 -1.88 5 1.99 3.29 0.00164 6 1.66 3.79 0.988 7 -0.670 -2.14 -0.503 8 -0.953 -3.45 -0.248 9 2.24 3.59 0.0209 10 0.161 2.47 0.534 # ℹ 15 more rows $pca_baked_data # A tibble: 32 × 3 mpg PC1 PC2 <dbl> <dbl> <dbl> 1 0.0945 0.633 2.03 2 0.0945 0.613 1.87 3 0.394 2.76 0.137 4 0.161 0.228 -2.17 5 -0.288 -2.01 -0.623 6 -0.388 0.191 -2.55 7 -1.02 -2.82 0.438 8 0.660 1.91 -1.13 9 0.394 2.31 -1.60 10 -0.205 0.622 0.125 # ℹ 22 more rows $pca_variance_df # A tibble: 10 × 6 PC var_explained var_pct_txt cum_var_pct cum_var_pct_txt ou_threshold <chr> <dbl> <chr> <dbl> <chr> <fct> 1 PC1 0.609 60.86% 0.609 60.86% Under 2 PC2 0.242 24.19% 0.850 85.05% Over 3 PC3 0.0619 6.19% 0.912 91.24% Over 4 PC4 0.0231 2.31% 0.935 93.55% Over 5 PC5 0.0215 2.15% 0.957 95.70% Over 6 PC6 0.0171 1.71% 0.974 97.41% Over 7 PC7 0.0112 1.12% 0.985 98.52% Over 8 PC8 0.00848 0.85% 0.994 99.37% Over 9 PC9 0.00409 0.41% 0.998 99.78% Over 10 PC10 0.00219 0.22% 1 100.00% Over $pca_rotation_df # A tibble: 10 × 10 PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8 PC9 <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> 1 -0.394 0.0328 -0.146 0.0804 -0.162 0.0655 -0.223 -0.0200 -0.728 2 -0.389 -0.0623 0.0134 -0.0763 0.200 -0.383 0.447 0.0126 -0.369 3 -0.356 0.213 0.240 0.342 -0.195 -0.145 0.270 0.614 0.282 4 0.321 0.295 0.0556 0.336 0.768 -0.125 -0.0333 0.127 -0.204 5 -0.358 -0.124 0.391 -0.367 0.305 -0.347 -0.105 -0.338 0.262 6 0.248 -0.432 0.425 -0.317 0.00567 -0.0377 -0.272 0.568 -0.259 7 0.319 -0.255 0.421 0.505 -0.328 -0.313 0.156 -0.373 -0.163 8 0.248 0.443 -0.217 -0.241 -0.294 -0.693 -0.253 0.0794 -0.0246 9 0.238 0.449 0.346 -0.431 -0.125 0.278 0.509 -0.0781 -0.222 10 -0.232 0.445 0.490 0.151 -0.0541 0.189 -0.494 -0.133 -0.0201 # ℹ 1 more variable: PC10 <dbl> $pca_variance_scree_plt
$pca_loadings_plt
$pca_loadings_plotly $pca_top_n_loadings_plt
$pca_top_n_plotly
Pretty easy as you can see.
< section id="example---histogram-facet-plot" class="level2">Example – Histogram Facet Plot
< section id="syntax-1" class="level3">Syntax
hai_histogram_facet_plot( .data, .bins = 10, .scale_data = FALSE, .ncol = 5, .fct_reorder = FALSE, .fct_rev = FALSE, .fill = "steelblue", .color = "white", .scale = "free", .interactive = FALSE )< section id="arguments-1" class="level3">
Arguments
.data
– The data you want to pass to the function..bins
– The number of bins for the histograms..scale_data
– This is a boolean set to FALSE. TRUE will use hai_scale_zero_one_vec() to [0, 1] scale the data..ncol
– The number of columns for the facet_warp argument..fct_reorder
– Should the factor column be reordered? TRUE/FALSE, default of FALSE.fct_rev
– Should the factor column be reversed? TRUE/FALSE, default of FALSE.fill
– Default is steelblue.color
– Default is ‘white’.scale
– Default is ‘free’.interactive
– Default is FALSE, TRUE will produce a plotly plot.
Working Example
hai_histogram_facet_plot(mtcars, .interactive = FALSE)
hai_histogram_facet_plot(mtcars, .interactive = FALSE, .scale_data = TRUE)
Example – Boilerplacte Funcationality
Now we are going to go over some simple boilerplate funcationality. I call it boilerplate because you don’t have to change anything if you dont want to. For the boilerplate function there is a corresponding data preprocessor that will get the data into the shape it needs to be in for the algorithm. Let’s take a look.
< section id="working-example-2" class="level3">Working Example
First lets look at the data, then we will look at it after the preprocessor.
head(iris)
Sepal.Length Sepal.Width Petal.Length Petal.Width Species 1 5.1 3.5 1.4 0.2 setosa 2 4.9 3.0 1.4 0.2 setosa 3 4.7 3.2 1.3 0.2 setosa 4 4.6 3.1 1.5 0.2 setosa 5 5.0 3.6 1.4 0.2 setosa 6 5.4 3.9 1.7 0.4 setosa
rec_obj <- hai_earth_data_prepper(iris, Species ~ .) rec_obj
Now to run it through the boilerplate:
auto_earth <- hai_auto_earth( .data = iris, .rec_obj = rec_obj, .best_metric = "f_meas", .model_type = "classification" )
Now let’s inspect the output:
names(auto_earth)
[1] "recipe_info" "model_info" "tuned_info"
Recipe Information
auto_earth[["recipe_info"]]
Model Information
auto_earth[["model_info"]]
$model_spec MARS Model Specification (classification) Main Arguments: num_terms = tune::tune() prod_degree = tune::tune() prune_method = none Computational engine: earth $wflw ══ Workflow ════════════════════════════════════════════════════════════════════ Preprocessor: Recipe Model: mars() ── Preprocessor ──────────────────────────────────────────────────────────────── 4 Recipe Steps • step_string2factor() • step_novel() • step_dummy() • step_zv() ── Model ─────────────────────────────────────────────────────────────────────── MARS Model Specification (classification) Main Arguments: num_terms = tune::tune() prod_degree = tune::tune() prune_method = none Computational engine: earth $fitted_wflw ══ Workflow [trained] ══════════════════════════════════════════════════════════ Preprocessor: Recipe Model: mars() ── Preprocessor ──────────────────────────────────────────────────────────────── 4 Recipe Steps • step_string2factor() • step_novel() • step_dummy() • step_zv() ── Model ─────────────────────────────────────────────────────────────────────── GLM (family binomial, link logit): nulldev df dev df devratio AIC iters converged setosa 144.779 111 52.6908 110 0.6360 56.69 22 1 versicolor 137.505 111 125.5536 110 0.0869 129.60 4 1 virginica 144.779 111 15.1575 110 0.8950 19.16 9 1 Earth selected 2 of 15 terms, and 1 of 4 predictors (pmethod="none") (nprune=2) Termination condition: Reached nk 21 Importance: Petal.Length-unused, Sepal.Length-unused, Sepal.Width-unused, ... Number of terms at each degree of interaction: 1 1 (additive model) Earth GCV RSS GRSq RSq setosa 0.15145196 16.066078 0.34455933 0.36796602 versicolor 0.20252995 21.484449 0.05906052 0.09266277 virginica 0.04535734 4.811523 0.80370644 0.81071635 All 0.36072282 38.265605 0.46747354 0.48649080 $was_tuned [1] "tuned"
Tuned Information
auto_earth[["tuned_info"]]
$tuning_grid # A tibble: 7 × 2 num_terms prod_degree <int> <int> 1 3 2 2 4 1 3 5 2 4 3 1 5 4 2 6 2 2 7 2 1 $cv_obj # Monte Carlo cross-validation (0.75/0.25) with 25 resamples # A tibble: 25 × 2 splits id <list> <chr> 1 <split [84/28]> Resample01 2 <split [84/28]> Resample02 3 <split [84/28]> Resample03 4 <split [84/28]> Resample04 5 <split [84/28]> Resample05 6 <split [84/28]> Resample06 7 <split [84/28]> Resample07 8 <split [84/28]> Resample08 9 <split [84/28]> Resample09 10 <split [84/28]> Resample10 # ℹ 15 more rows $tuned_results # Tuning results # Monte Carlo cross-validation (0.75/0.25) with 25 resamples # A tibble: 25 × 4 splits id .metrics .notes <list> <chr> <list> <list> 1 <split [84/28]> Resample01 <tibble [77 × 6]> <tibble [5 × 3]> 2 <split [84/28]> Resample02 <tibble [77 × 6]> <tibble [5 × 3]> 3 <split [84/28]> Resample03 <tibble [77 × 6]> <tibble [5 × 3]> 4 <split [84/28]> Resample04 <tibble [77 × 6]> <tibble [5 × 3]> 5 <split [84/28]> Resample05 <tibble [77 × 6]> <tibble [5 × 3]> 6 <split [84/28]> Resample06 <tibble [77 × 6]> <tibble [5 × 3]> 7 <split [84/28]> Resample07 <tibble [77 × 6]> <tibble [5 × 3]> 8 <split [84/28]> Resample08 <tibble [77 × 6]> <tibble [5 × 3]> 9 <split [84/28]> Resample09 <tibble [77 × 6]> <tibble [5 × 3]> 10 <split [84/28]> Resample10 <tibble [77 × 6]> <tibble [5 × 3]> # ℹ 15 more rows There were issues with some computations: - Warning(s) x5: While computing multiclass `precision()`, some levels had no pred... - Warning(s) x6: While computing multiclass `precision()`, some levels had no pred... - Warning(s) x1: While computing multiclass `precision()`, some levels had no pred... - Warning(s) x1: While computing multiclass `precision()`, some levels had no pred... - Warning(s) x2: While computing multiclass `precision()`, some levels had no pred... - Warning(s) x1: While computing multiclass `precision()`, some levels had no pred... - Warning(s) x5: While computing multiclass `precision()`, some levels had no pred... - Warning(s) x4: While computing multiclass `precision()`, some levels had no pred... - Warning(s) x48: glm.fit: algorithm did not converge, glm.fit: fitted probabilitie... - Warning(s) x2: glm.fit: algorithm did not converge, glm.fit: fitted probabilitie... - Warning(s) x49: glm.fit: fitted probabilities numerically 0 or 1 occurred, glm.fi... - Warning(s) x1: glm.fit: fitted probabilities numerically 0 or 1 occurred, glm.fi... Run `show_notes(.Last.tune.result)` for more information. $grid_size [1] 10 $best_metric [1] "f_meas" $best_result_set # A tibble: 1 × 8 num_terms prod_degree .metric .estimator mean n std_err .config <int> <int> <chr> <chr> <dbl> <int> <dbl> <chr> 1 2 1 f_meas macro 0.124 25 0.0154 Preprocessor1_Mo… $tuning_grid_plot
$plotly_grid_plot
Metric Sets
With this package there comes some metric set’s that can be computed using the yardstick
package. These are the metric sets that are available:
Classification
hai_default_classification_metric_set()
A metric set, consisting of: - `sensitivity()`, a class metric | direction: maximize - `specificity()`, a class metric | direction: maximize - `recall()`, a class metric | direction: maximize - `precision()`, a class metric | direction: maximize - `mcc()`, a class metric | direction: maximize - `accuracy()`, a class metric | direction: maximize - `f_meas()`, a class metric | direction: maximize - `kap()`, a class metric | direction: maximize - `ppv()`, a class metric | direction: maximize - `npv()`, a class metric | direction: maximize - `bal_accuracy()`, a class metric | direction: maximize
Regression
hai_default_regression_metric_set()
A metric set, consisting of: - `mae()`, a numeric metric | direction: minimize - `mape()`, a numeric metric | direction: minimize - `mase()`, a numeric metric | direction: minimize - `smape()`, a numeric metric | direction: minimize - `rmse()`, a numeric metric | direction: minimize - `rsq()`, a numeric metric | direction: maximize
Here is a list of the items currently on it as of writing this article:
- Plotting Functions – Functions for plotting.
- Clustering Functions – Functions for clustering and analysis.
- Boiler Plate Functions – Functions for automatic recipes, workflows, and tuned models.
- Dimensionality Reduction – Functions for dimension reduction.
- Data Wrangling – Functions for data wrangling.
- Data Preprocessors – Functions for data preprocessing.
- Recipe Steps – Functions to add recipe steps.
- Table Functions – Functions that return tibbles.
- Vectorized Functions – Vector functions.
- Augmenting Functions – Functions for data augmentation.
- Miscellaneous Functions – Miscellaneous utility functions.
- Metric Sets – Metric sets for evaluation.
For more detailed information, you can visit the healthyR.ai function reference page.
< section id="conclusion" class="level1">Conclusion
I hope this helped a bit with understanding the healthyR.ai
package. It is a very powerful package that can help you with a lot of different tasks. I will be writing more about this package in the future. If you have any questions or comments, please feel free to reach out to me at any of these:
People can get in touch with you through the following social profiles:
- LinkedIn: linkedin.com/in/spsanderson
- Mastodon: mstdn.social/@stevensanderson
- Telegram: t.me/steveondata
Happy coding!
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.