Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Problem Setup
In recent question on LinkedIn’s R user group, a user asked “How to normalize by the row sums of the variable?”. Now first, we must define what we mean by “normalize” a matrix/data.frame.
One way to standardize/normalize a row is to subtract by the mean and divide by the max to put the data into the [0, 1] domain. Many times, however, users want to perform z-score normalization by doing:
(x – μ)/σ
where μ/σ are the mean/standard deviation of the column (typically the column).
The answer on that forum eventually came down to using the scale
command. The scale
function is great. It is simple and has 2 options other than passing in the matrix:
- Center – should the columns have their mean subracted off?
- Scale – should the columns have their standard deviation divided after/not centering?
Overall, the function is fast, but it can always be faster.
The matrixStats
package has some very quick operations for row/column operations and computing statistics along these margins.
Creating Some Data
Here we will create a random normal matrix with each column having a mean of 14 and a standard deviation of 5.
set.seed(1) mat = matrix(rnorm(1e7, mean = 14, sd = 5), nrow = 1e5) dim(mat)
How fast is scale?
Let’s see how fast the scale
function is:
system.time({ sx = scale(mat) }) user system elapsed 0.971 0.285 1.262
That’s pretty fast! Overall, there is not much room for improvement in this case, but it may be relevant if you have a lot of matrices or ones bigger than the one defined here in mat
.
Defining a new function
First, we must load in the matrixStats
package and the only function we really are using is the colSds
.
library(matrixStats) colScale = function(x, center = TRUE, scale = TRUE, add_attr = TRUE, rows = NULL, cols = NULL) { if (!is.null(rows) && !is.null(cols)) { x <- x[rows, cols, drop = FALSE] } else if (!is.null(rows)) { x <- x[rows, , drop = FALSE] } else if (!is.null(cols)) { x <- x[, cols, drop = FALSE] } ################ # Get the column means ################ cm = colMeans(x, na.rm = TRUE) ################ # Get the column sd ################ if (scale) { csd = colSds(x, center = cm) } else { # just divide by 1 if not csd = rep(1, length = length(cm)) } if (!center) { # just subtract 0 cm = rep(0, length = length(cm)) } x = t( (t(x) - cm) / csd ) if (add_attr) { if (center) { attr(x, "scaled:center") <- cm } if (scale) { attr(x, "scaled:scale") <- csd } } return(x) }
Let’s break down what the function is doing:
- The function takes in a matrix
x
with options:- subsetting
rows
orcols
- center each column (by the mean) or not
- scale each column (by the standard deviation) or not
- Add the attributes of center/scale, so they match the
scale
output.
- subsetting
- The functions subsets the matrix if options are passed.
- Column means are calculated
- Column standard deviations are calculated (using the colmeans) if
scale = TRUE
or simply set to 1 ifscale = FALSE
. - If the data is not to be centered, the centers are set to 0.
- The data is transposed and the mean is subtracted then the result is divded by the standard deviation. The data is transposed back.
- The reason this is done is because
R
operates column-wise. Letp
be the number of columns. The column means/sds are of lengthp
. If one simply subtracted the column means,R
would try to do this to each individual column. For instance, it would recycle thep
numbers to get to lengthn
(number of rows), and do that subtraction, which is not what we want.
- The reason this is done is because
- The attributes are added to the matrix to match
scale
output.
colScale timing
Now we can see how fast the colScale
command would take:
system.time({ csx = colScale(mat) }) user system elapsed 0.231 0.039 0.271
This is a lot faster than the scale
function. First and foremost, let us make sure that these give the same results:
all.equal(sx, csx) [1] TRUE
Better benchmarking
OK, we found that we can speed up this operation, but maybe this was a one-off event. Let’s use the microbenchmark
package to
library(microbenchmark) mb = microbenchmark(colScale(mat), scale(mat), times = 20, unit = "s") print(mb) Unit: seconds expr min lq mean median uq max colScale(mat) 0.2738255 0.3426157 0.4682762 0.3770815 0.4872505 1.844507 scale(mat) 1.2945400 1.5511671 1.9378106 1.9226087 2.2731682 2.601223 neval cld 20 a 20 b
We can visualize the results using ggplot2
and some violin plots.
library(ggplot2) g = ggplot(data = mb, aes(y = time / 1e9, x = expr)) + geom_violin() + theme_grey(base_size = 20) + xlab("Method") + ylab("Time (seconds)") print(g)
What about scaling rows!
If you note above, we did not standardize the matrix with respect to the rows, but rather the columns. We can perform this simply by transposing the matrix, running scale and then transposing the matrix back:
system.time({ scaled_row = t( scale(t(mat)) ) }) user system elapsed 2.165 0.624 3.398 all(abs(rowMeans(scaled_row)) < 1e-15) [1] TRUE
Again, we can do the same thing with colScale
:
system.time({ colscaled_row = t( colScale(t(mat)) ) }) user system elapsed 0.426 0.097 0.542 all(abs(rowMeans(colscaled_row)) < 1e-15) [1] TRUE all.equal(colscaled_row, scaled_row) [1] TRUE
And we see the results are identical
Creating rowScale
The above results are good for what we would like to do. We may want to define the rowScale
function (as below), where we do not have to do the transposing and transposing back, as this takes may take some extra time.
Again, if we’re about improving speed, this may help.
rowScale = function(x, center = TRUE, scale = TRUE, add_attr = TRUE, rows = NULL, cols = NULL) { if (!is.null(rows) && !is.null(cols)) { x <- x[rows, cols, drop = FALSE] } else if (!is.null(rows)) { x <- x[rows, , drop = FALSE] } else if (!is.null(cols)) { x <- x[, cols, drop = FALSE] } ################ # Get the column means ################ cm = rowMeans(x, na.rm = TRUE) ################ # Get the column sd ################ if (scale) { csd = rowSds(x, center = cm) } else { # just divide by 1 if not csd = rep(1, length = length(cm)) } if (!center) { # just subtract 0 cm = rep(0, length = length(cm)) } x = (x - cm) / csd if (add_attr) { if (center) { attr(x, "scaled:center") <- cm } if (scale) { attr(x, "scaled:scale") <- csd } } return(x) }
Now let’s see how we do with rowScale
:
system.time({ rowscaled_row = rowScale(mat) }) user system elapsed 0.174 0.016 0.206 all(abs(rowMeans(rowscaled_row)) < 1e-15) [1] TRUE all.equal(rowscaled_row, scaled_row) [1] TRUE
Let’s look at the times for this breakdown using microbenchmark
:
mb_row = microbenchmark(t( colScale(t(mat)) ), t( scale(t(mat)) ), rowScale(mat), times = 20, unit = "s") print(mb_row) Unit: seconds expr min lq mean median uq t(colScale(t(mat))) 0.4009850 0.4653892 0.6303221 0.6659232 0.7422429 t(scale(t(mat))) 1.7889625 2.0211590 2.4763732 2.1928348 2.6543272 rowScale(mat) 0.1665216 0.1789968 0.2688652 0.2228373 0.3413327 max neval cld 0.9008130 20 a 5.0518348 20 b 0.5138103 20 a
and visualize the results:
g %+% mb_row
Conclusions
Overall, normalizing a matrix using a z-score transformation can be very fast and efficient. The scale
function is well suited for this purpose, but the matrixStats
package allows for faster computation done in C. The scale
function will have different behavior as the code below from base::scale.default
:
f <- function(v) { v <- v[!is.na(v)] sqrt(sum(v^2)/max(1, length(v) - 1L)) } scale <- apply(x, 2L, f)
If the data is not centered and center = FALSE
, the data will be divided by the squared sum of each column (divided by n-1
). This may be the desired behavior, but the user may want to divide by the standard deviation and not this squared sum and colScale/rowScale
can do that if necessary. I will talk to Henrik Bengtsson (matrixStats
author/maintainer) about incorporating these into matrixStats
for general use. But for now, you can use the above code.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.