Binning Data with rbin
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
We are happy to introduce the rbin package, a set of tools for binning/discretization of data, designed keeping in mind beginner/intermediate R users. It comes with two RStudio addins for interactive binning.
Installation
# Install release version from CRAN install.packages("rbin") # Install development version from GitHub # install.packages("devtools") devtools::install_github("rsquaredacademy/rbin")
RStudio Addins
rbin includes two RStudio addins for manually binning data. Below is a demo:
Read on to learn more about the features of rbin, or see the rbin website for detailed documentation on using the package.
Introduction
Binning is the process of transforming numerical or continuous data into categorical data. It is a common data pre-processing step of the model building process. rbin has the following features:
- manual binning using shiny app
- equal length binning method
- winsorized binning method
- quantile binning method
- combine levels of categorical data
- create dummy variables based on binning method
- calculates weight of evidence (WOE), entropy and information value (IV)
- provides summary information about binning process
Manual Binning
For manual binning, you need to specify the cut points for the bins. rbin
follows the left closed and right open interval ([0,1) = {x | 0 ≤ x < 1}
) for
creating bins. The number of cut points you specify is one less than the number
of bins you want to create i.e. if you want to create 10 bins, you need to
specify only 9 cut points as shown in the below example. The accompanying
RStudio addin, rbinAddin()
can be used to iteratively bin the data and to
enforce monotonic increasing/decreasing trend.
After finalizing the bins, you can use rbin_create()
to create the dummy
variables.
Bins
bins <- rbin_manual(mbank, y, age, c(29, 31, 34, 36, 39, 42, 46, 51, 56)) bins ## Binning Summary ## --------------------------- ## Method Manual ## Response y ## Predictor age ## Bins 10 ## Count 4521 ## Goods 517 ## Bads 4004 ## Entropy 0.5 ## Information Value 0.12 ## ## ## # A tibble: 10 x 7 ## cut_point bin_count good bad woe iv entropy ## <chr> <int> <int> <int> <dbl> <dbl> <dbl> ## 1 < 29 410 71 339 -0.484 0.0255 0.665 ## 2 < 31 313 41 272 -0.155 0.00176 0.560 ## 3 < 34 567 55 512 0.184 0.00395 0.459 ## 4 < 36 396 45 351 0.00712 0.00000443 0.511 ## 5 < 39 519 47 472 0.260 0.00701 0.438 ## 6 < 42 431 33 398 0.443 0.0158 0.390 ## 7 < 46 449 47 402 0.0993 0.000942 0.484 ## 8 < 51 521 40 481 0.440 0.0188 0.391 ## 9 < 56 445 49 396 0.0426 0.000176 0.500 ## 10 >= 56 470 89 381 -0.593 0.0456 0.700
Plot
plot(bins)
Dummy Variables
bins <- rbin_manual(mbank, y, age, c(29, 31, 34, 36, 39, 42, 46, 51, 56)) rbin_create(mbank, age, bins) ## # A tibble: 4,521 x 26 ## age job marital education default balance housing loan ## <int> <fct> <fct> <fct> <fct> <dbl> <fct> <fct> ## 1 34 technician married tertiary no 297 yes no ## 2 49 services married secondary no 180 yes yes ## 3 38 admin. single secondary no 262 no no ## 4 47 services married secondary no 367 yes no ## 5 51 self-employed single secondary no 1640 yes no ## 6 40 unemployed married secondary no 3382 yes no ## 7 58 retired married secondary no 1227 no no ## 8 32 unemployed married primary no 309 yes no ## 9 46 blue-collar married secondary no 922 yes no ## 10 32 services married tertiary no 0 no no ## contact day month duration campaign pdays previous poutcome y ## <fct> <int> <fct> <int> <int> <int> <int> <fct> <fct> ## 1 cellular 29 jan 375 2 -1 0 unknown 0 ## 2 unknown 2 jun 392 3 -1 0 unknown 0 ## 3 cellular 3 feb 315 2 180 6 failure 1 ## 4 cellular 12 may 309 1 306 4 success 1 ## 5 unknown 15 may 67 4 -1 0 unknown 0 ## 6 unknown 14 may 125 1 -1 0 unknown 0 ## 7 cellular 14 aug 182 2 37 2 failure 0 ## 8 telephone 13 may 185 1 370 3 failure 0 ## 9 telephone 18 nov 296 2 -1 0 unknown 0 ## 10 cellular 21 nov 80 1 -1 0 unknown 0 ## `age_<_31` `age_<_34` `age_<_36` `age_<_39` `age_<_42` `age_<_46` ## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 0 0 1 0 0 0 ## 2 0 0 0 0 0 0 ## 3 0 0 0 1 0 0 ## 4 0 0 0 0 0 0 ## 5 0 0 0 0 0 0 ## 6 0 0 0 0 1 0 ## 7 0 0 0 0 0 0 ## 8 0 1 0 0 0 0 ## 9 0 0 0 0 0 0 ## 10 0 1 0 0 0 0 ## `age_<_51` `age_<_56` `age_>=_56` ## <dbl> <dbl> <dbl> ## 1 0 0 0 ## 2 1 0 0 ## 3 0 0 0 ## 4 1 0 0 ## 5 0 1 0 ## 6 0 0 0 ## 7 0 0 1 ## 8 0 0 0 ## 9 1 0 0 ## 10 0 0 0 ## # ... with 4,511 more rows
Factor Binning
You can collapse or combine levels of a factor/categorical variable using
rbin_factor_combine()
and then use rbin_factor()
to look at weight of
evidence, entropy and information value. After finalizing the bins, you can
use rbin_factor_create()
to create the dummy variables. You can use the
RStudio addin, rbinFactorAddin()
to interactively combine the levels and
create dummy variables after finalizing the bins.
Combine Levels
upper <- c("secondary", "tertiary") out <- rbin_factor_combine(mbank, education, upper, "upper") table(out$education) ## ## primary unknown upper ## 691 179 3651 out <- rbin_factor_combine(mbank, education, c("secondary", "tertiary"), "upper") table(out$education) ## ## primary unknown upper ## 691 179 3651
Bins
bins <- rbin_factor(mbank, y, education) bins ## Binning Summary ## --------------------------- ## Method Custom ## Response y ## Predictor education ## Levels 4 ## Count 4521 ## Goods 517 ## Bads 4004 ## Entropy 0.51 ## Information Value 0.05 ## ## ## # A tibble: 4 x 7 ## level bin_count good bad woe iv entropy ## <fct> <int> <int> <int> <dbl> <dbl> <dbl> ## 1 tertiary 1299 195 1104 -0.313 0.0318 0.610 ## 2 secondary 2352 231 2121 0.170 0.0141 0.463 ## 3 unknown 179 25 154 -0.229 0.00227 0.583 ## 4 primary 691 66 625 0.201 0.00572 0.455
Plot
plot(bins)
Create Bins
upper <- c("secondary", "tertiary") out <- rbin_factor_combine(mbank, education, upper, "upper") rbin_factor_create(out, education) ## # A tibble: 4,521 x 19 ## age job marital default balance housing loan contact ## <int> <fct> <fct> <fct> <dbl> <fct> <fct> <fct> ## 1 34 technician married no 297 yes no cellular ## 2 49 services married no 180 yes yes unknown ## 3 38 admin. single no 262 no no cellular ## 4 47 services married no 367 yes no cellular ## 5 51 self-employed single no 1640 yes no unknown ## 6 40 unemployed married no 3382 yes no unknown ## 7 58 retired married no 1227 no no cellular ## 8 32 unemployed married no 309 yes no telephone ## 9 46 blue-collar married no 922 yes no telephone ## 10 32 services married no 0 no no cellular ## day month duration campaign pdays previous poutcome y education ## <int> <fct> <int> <int> <int> <int> <fct> <fct> <fct> ## 1 29 jan 375 2 -1 0 unknown 0 upper ## 2 2 jun 392 3 -1 0 unknown 0 upper ## 3 3 feb 315 2 180 6 failure 1 upper ## 4 12 may 309 1 306 4 success 1 upper ## 5 15 may 67 4 -1 0 unknown 0 upper ## 6 14 may 125 1 -1 0 unknown 0 upper ## 7 14 aug 182 2 37 2 failure 0 upper ## 8 13 may 185 1 370 3 failure 0 primary ## 9 18 nov 296 2 -1 0 unknown 0 upper ## 10 21 nov 80 1 -1 0 unknown 0 upper ## education_unknown education_upper ## <dbl> <dbl> ## 1 0 1 ## 2 0 1 ## 3 0 1 ## 4 0 1 ## 5 0 1 ## 6 0 1 ## 7 0 1 ## 8 0 0 ## 9 0 1 ## 10 0 1 ## # ... with 4,511 more rows
Quantile Binning
Quantile binning aims to bin the data into roughly equal groups using quantiles.
bins <- rbin_quantiles(mbank, y, age, 10) bins ## Binning Summary ## ----------------------------- ## Method Quantile ## Response y ## Predictor age ## Bins 10 ## Count 4521 ## Goods 517 ## Bads 4004 ## Entropy 0.5 ## Information Value 0.12 ## ## ## # A tibble: 10 x 7 ## cut_point bin_count good bad woe iv entropy ## <chr> <int> <int> <int> <dbl> <dbl> <dbl> ## 1 < 29 410 71 339 -0.484 0.0255 0.665 ## 2 < 31 313 41 272 -0.155 0.00176 0.560 ## 3 < 34 567 55 512 0.184 0.00395 0.459 ## 4 < 36 396 45 351 0.00712 0.00000443 0.511 ## 5 < 39 519 47 472 0.260 0.00701 0.438 ## 6 < 42 431 33 398 0.443 0.0158 0.390 ## 7 < 46 449 47 402 0.0993 0.000942 0.484 ## 8 < 51 521 40 481 0.440 0.0188 0.391 ## 9 < 56 445 49 396 0.0426 0.000176 0.500 ## 10 >= 56 470 89 381 -0.593 0.0456 0.700
Plot
plot(bins)
Winsorized Binning
Winsorized binning is similar to equal length binning except that both tails are cut off to obtain a smooth binning result. This technique is often used to remove outliers during the data pre-processing stage. For Winsorized binning, the Winsorized statistics are computed first. After the minimum and maximum have been found, the split points are calculated the same way as in equal length binning.
bins <- rbin_winsorize(mbank, y, age, 10, winsor_rate = 0.05) bins ## Binning Summary ## ------------------------------ ## Method Winsorize ## Response y ## Predictor age ## Bins 10 ## Count 4521 ## Goods 517 ## Bads 4004 ## Entropy 0.51 ## Information Value 0.1 ## ## ## # A tibble: 10 x 7 ## cut_point bin_count good bad woe iv entropy ## <chr> <int> <int> <int> <dbl> <dbl> <dbl> ## 1 < 30.2 723 112 611 -0.350 0.0224 0.622 ## 2 < 33.4 567 55 512 0.184 0.00395 0.459 ## 3 < 36.6 573 58 515 0.137 0.00225 0.473 ## 4 < 39.8 497 44 453 0.285 0.00798 0.432 ## 5 < 43 396 37 359 0.225 0.00408 0.448 ## 6 < 46.2 461 43 418 0.227 0.00482 0.447 ## 7 < 49.4 281 22 259 0.419 0.00927 0.396 ## 8 < 52.6 309 32 277 0.111 0.000811 0.480 ## 9 < 55.8 244 25 219 0.123 0.000781 0.477 ## 10 >= 55.8 470 89 381 -0.593 0.0456 0.700
Plot
plot(bins)
Learning More
The rbin website includes comprehensive documentation on using the package, including the following article that gives a brief introduction to rbin:
Feedback
rbin has been on CRAN for a few months now while we were fixing bugs and making the API stable. All feedback is welcome. Issues (bugs and feature requests) can be posted to github tracker. For help with code or other related questions, feel free to reach out to us at [email protected].
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.