New activation functions in mlsauce’s LSBoost
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
In previous posts, I introduced LSBoost; a gradient boosting machine that uses randomized and penalized least squares as a basis – instead of decision trees which are frequently used as base learners. mlsauce’s LSBoost takes into account a problem’s nonlinearity by including new, engineered explanatory variables \(g(XW+b)\) with:
- \(g\): an activation function (tanh, ReLU, sigmoid, …)
- \(X\): input data (covariates, explanatory variables)
- \(W\): a matrix containing numbers drawn from a multivariate uniform distribution on \([0, 1]\)
New activation functions were added to version 0.8.0 of mlsauce: ReLU6, tanh, sigmoid. These changes are available both in R and in the Python implementation of mlsauce.
The following R example illustrates the differences between out-of-sample errors, when \(g\) = sigmoid or \(g\) = tanh. Of course, LSBoost can be tuned further than what’s demonstrated here.
# Input data X <- as.matrix(MASS::Boston[, -1]) y <- as.integer(MASS::Boston[, 1]) n <- dim(X)[1] p <- dim(X)[2] # number of repeats for obtaining the distribution of errors n_repeats <- 100 # function for calculating the out-of-sample error, based on activation functions get_rmse_error <- function(activation = c("sigmoid", "tanh", "relu6", "relu")) { err <- rep(0, n_repeats) pb <- txtProgressBar(min = 0, max = n_repeats, style = 3) for (i in 1:n_repeats) { set.seed(21341+i*10) train_index <- sample(x = 1:n, size = floor(0.8*n), replace = TRUE) test_index <- -train_index X_train <- as.matrix(X[train_index, ]) y_train <- as.double(y[train_index]) X_test <- as.matrix(X[test_index, ]) y_test <- as.double(y[test_index]) # using default parameters obj <- mlsauce::LSBoostRegressor(verbose = FALSE, activation = match.arg(activation)) obj$fit(X_train, y_train) err[i] <- sqrt(mean((obj$predict(X_test) - y_test)**2)) setTxtProgressBar(pb, i) } return(err) } # test set error for g=sigmoid (err1 <- get_rmse_error("sigmoid")) # test set error for g=tanh (err2 <- get_rmse_error("tanh")) # distribution of test set error par(mfrow=c(1, 2)) hist(err1, main = "distribution of test set error \n (activation = sigmoid)") hist(err2, main = "distribution of test set error \n (activation = tanh)")
> print(sessionInfo()) R version 4.0.3 (2020-10-10) Platform: x86_64-pc-linux-gnu (64-bit) Running under: Ubuntu 16.04.7 LTS Matrix products: default BLAS: /usr/lib/atlas-base/atlas/libblas.so.3.0 LAPACK: /usr/lib/atlas-base/atlas/liblapack.so.3.0 locale: [1] LC_CTYPE=C.UTF-8 LC_NUMERIC=C LC_TIME=C.UTF-8 [4] LC_COLLATE=C.UTF-8 LC_MONETARY=C.UTF-8 LC_MESSAGES=C.UTF-8 [7] LC_PAPER=C.UTF-8 LC_NAME=C LC_ADDRESS=C [10] LC_TELEPHONE=C LC_MEASUREMENT=C.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] stats graphics grDevices utils datasets methods base loaded via a namespace (and not attached): [1] MASS_7.3-53 compiler_4.0.3 Matrix_1.2-18 tools_4.0.3 rappdirs_0.3.3 [6] Rcpp_1.0.6 reticulate_1.18 grid_4.0.3 jsonlite_1.7.2 mlsauce_0.8.0 [11] lattice_0.20-41
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.