Modeling muti-category Outcomes With vtreat
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
vtreat
is a powerful R
package for preparing messy real-world data for machine learning. We have further extended the package with a number of features including rquery/rqdatatable integration (allowing vtreat application at scale on Apache Spark or data.table!).
In addition vtreat
and can now effectively prepare data for multi-class classification or multinomial modeling.
The two functions needed (mkCrossFrameMExperiment()
and the S3
method prepare.multinomial_plan()
) are now part of vtreat
.
Let’s work a specific example: trying to model multi-class y
as a function of x1
and x2
.
library("vtreat")
# create example data set.seed(326346) sym_bonuses <- rnorm(3) names(sym_bonuses) <- c("a", "b", "c") sym_bonuses3 <- rnorm(3) names(sym_bonuses3) <- as.character(seq_len(length(sym_bonuses3))) n_row <- 1000 d <- data.frame( x1 = rnorm(n_row), x2 = sample(names(sym_bonuses), n_row, replace = TRUE), x3 = sample(names(sym_bonuses3), n_row, replace = TRUE), y = "NoInfo", stringsAsFactors = FALSE) d$y[sym_bonuses[d$x2] > pmax(d$x1, sym_bonuses3[d$x3], runif(n_row))] <- "Large1" d$y[sym_bonuses3[d$x3] > pmax(sym_bonuses[d$x2], d$x1, runif(n_row))] <- "Large2" knitr::kable(head(d))
x1 | x2 | x3 | y |
---|---|---|---|
0.8178292 | b | 3 | Large2 |
0.5867139 | b | 3 | Large2 |
-0.6711920 | a | 3 | Large2 |
0.1033166 | c | 2 | NoInfo |
-0.3182176 | b | 1 | NoInfo |
-0.5914308 | c | 2 | NoInfo |
We define the problem controls and use mkCrossFrameMExperiment()
to build both a cross-frame and a treatment plan.
# define problem vars <- c("x1", "x2", "x3") y_name <- "y" # build the multi-class cross frame and treatments cfe_m <- mkCrossFrameMExperiment(d, vars, y_name)
The cross-frame is the entity safest for training on (unless you have made separate data split for the treatment design step). It uses cross-validation to reduce nested model bias. Some notes on this issue are available here, and here.
# look at the data we would train models on str(cfe_m$cross_frame)
## 'data.frame': 1000 obs. of 16 variables: ## $ x1_clean : num 0.818 0.587 -0.671 0.103 -0.318 ... ## $ x2_catP : num 0.313 0.313 0.325 0.362 0.313 0.362 0.362 0.325 0.313 0.325 ... ## $ x3_catP : num 0.333 0.333 0.333 0.347 0.32 0.347 0.333 0.347 0.333 0.347 ... ## $ x2_lev_x_a : num 0 0 1 0 0 0 0 1 0 1 ... ## $ x2_lev_x_b : num 1 1 0 0 1 0 0 0 1 0 ... ## $ x2_lev_x_c : num 0 0 0 1 0 1 1 0 0 0 ... ## $ x3_lev_x_1 : num 0 0 0 0 1 0 0 0 0 0 ... ## $ x3_lev_x_2 : num 0 0 0 1 0 1 0 1 0 1 ... ## $ x3_lev_x_3 : num 1 1 1 0 0 0 1 0 1 0 ... ## $ Large1_x2_catB: num -11.23 -11.2 1.25 -11.41 -11.27 ... ## $ Large1_x3_catB: num -11.356 -11.239 -11.239 0.379 0.431 ... ## $ Large2_x2_catB: num 0.0862 0.1446 -0.0243 -0.1268 0.0862 ... ## $ Large2_x3_catB: num 4.98 6.09 4.69 -3.11 -13.86 ... ## $ NoInfo_x2_catB: num -0.0537 0.1084 -0.2827 0.2859 0.1084 ... ## $ NoInfo_x3_catB: num -4.82 -5.24 -4.83 2.13 2.53 ... ## $ y : chr "Large2" "Large2" "Large2" "NoInfo" ...
prepare()
can apply the designed treatments to new data. Here we are simulating new data by re-using our design data.
# pretend original data is new data to be treated # NA out top row to show processing for(vi in vars) { d[[vi]][[1]] <- NA } str(prepare(cfe_m$treat_m, d))
## 'data.frame': 1000 obs. of 16 variables: ## $ x1_clean : num 0.0205 0.5867 -0.6712 0.1033 -0.3182 ... ## $ x2_catP : num 0.0005 0.313 0.325 0.362 0.313 0.362 0.362 0.325 0.313 0.325 ... ## $ x3_catP : num 0.0005 0.333 0.333 0.347 0.32 0.347 0.333 0.347 0.333 0.347 ... ## $ x2_lev_x_a : num 0 0 1 0 0 0 0 1 0 1 ... ## $ x2_lev_x_b : num 0 1 0 0 1 0 0 0 1 0 ... ## $ x2_lev_x_c : num 0 0 0 1 0 1 1 0 0 0 ... ## $ x3_lev_x_1 : num 0 0 0 0 1 0 0 0 0 0 ... ## $ x3_lev_x_2 : num 0 0 0 1 0 1 0 1 0 1 ... ## $ x3_lev_x_3 : num 0 1 1 0 0 0 1 0 1 0 ... ## $ Large1_x2_catB: num 0 -11.6 1.2 -11.8 -11.6 ... ## $ Large1_x3_catB: num 0 -11.702 -11.702 0.411 0.436 ... ## $ Large2_x2_catB: num 0 0.133 -0.0215 -0.0999 0.133 ... ## $ Large2_x3_catB: num 0 5.1 5.1 -3.54 -14.29 ... ## $ NoInfo_x2_catB: num 0 0.0206 -0.2829 0.2536 0.0206 ... ## $ NoInfo_x3_catB: num 0 -4.95 -4.95 2.11 2.34 ... ## $ y : chr "Large2" "Large2" "Large2" "NoInfo" ...
We can easily estimate per-outcome variable importance and per-variable variable importance.
knitr::kable( cfe_m$score_frame[, c("varName", "rsq", "sig", "outcome_level"), drop = FALSE])
varName | rsq | sig | outcome_level |
---|---|---|---|
x1_clean | 0.0558908 | 0.0000382 | Large1 |
x2_catP | 0.0275238 | 0.0038536 | Large1 |
x2_lev_x_a | 0.2680953 | 0.0000000 | Large1 |
x2_lev_x_b | 0.0885021 | 0.0000002 | Large1 |
x2_lev_x_c | 0.1060407 | 0.0000000 | Large1 |
x3_catP | 0.0000346 | 0.9183445 | Large1 |
x3_lev_x_1 | 0.0141504 | 0.0382554 | Large1 |
x3_lev_x_2 | 0.0140364 | 0.0390420 | Large1 |
x3_lev_x_3 | 0.0955004 | 0.0000001 | Large1 |
x1_clean | 0.0015382 | 0.1615618 | Large2 |
x2_catP | 0.0013055 | 0.1971725 | Large2 |
x2_lev_x_a | 0.0000387 | 0.8242956 | Large2 |
x2_lev_x_b | 0.0014571 | 0.1730603 | Large2 |
x2_lev_x_c | 0.0009604 | 0.2686774 | Large2 |
x3_catP | 0.0007725 | 0.3211959 | Large2 |
x3_lev_x_1 | 0.2602002 | 0.0000000 | Large2 |
x3_lev_x_2 | 0.2483708 | 0.0000000 | Large2 |
x3_lev_x_3 | 0.9197595 | 0.0000000 | Large2 |
x1_clean | 0.0064771 | 0.0034947 | NoInfo |
x2_catP | 0.0040540 | 0.0208595 | NoInfo |
x2_lev_x_a | 0.0071709 | 0.0021196 | NoInfo |
x2_lev_x_b | 0.0000340 | 0.8323647 | NoInfo |
x2_lev_x_c | 0.0060493 | 0.0047665 | NoInfo |
x3_catP | 0.0006576 | 0.3520950 | NoInfo |
x3_lev_x_1 | 0.1838759 | 0.0000000 | NoInfo |
x3_lev_x_2 | 0.1857824 | 0.0000000 | NoInfo |
x3_lev_x_3 | 0.7372570 | 0.0000000 | NoInfo |
Large1_x2_catB | 0.2675964 | 0.0000000 | Large1 |
Large1_x3_catB | 0.0946910 | 0.0000001 | Large1 |
Large2_x2_catB | 0.0000291 | 0.8472707 | Large2 |
Large2_x3_catB | 0.9239860 | 0.0000000 | Large2 |
NoInfo_x2_catB | 0.0068238 | 0.0027207 | NoInfo |
NoInfo_x3_catB | 0.7326682 | 0.0000000 | NoInfo |
One can relate these per-target and per-treatment performances back to original columns by aggregating.
tapply(cfe_m$score_frame$rsq, cfe_m$score_frame$origName, max)
## x1 x2 x3 ## 0.05589076 0.26809534 0.92398602
tapply(cfe_m$score_frame$sig, cfe_m$score_frame$origName, min)
## x1 x2 x3 ## 3.819834e-05 1.892838e-19 5.746904e-258
Obvious issues include: computing variable importance, and blow up and co-dependency of produced columns. These we leave for the next modeling step to deal with (this is our philosophy with most issues that involve joint distributions of variables).
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.