Normalising data within groups
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Occasionally it proves useful to normalise data. By this I mean to scale it between zero and one. Admittedly, most people frown of this but there are papers out there with this method in use*.
How do we go about this? Its a very simple formula to calculate:
y'[i] = y[i]/sqrt(sum(y^2))
So we square all of the ys, add them up and take the square root (call in the denominator). Then we divide each individual y value by the denominator.
In R this is simple – for instance decostand in the vegan package does exactly this (plus a whole heap of other standardisations).
But what I couldnt find was a function to take it a step further, a function that normalised within groups:
y'[ij] = y[ij]/sqrt(sum(y^2[j]))
The difference here are the js of course. Or to go a step further still:
y'[ijk] = y[ijk]/sqrt(sum(y^2[jk]))
where the ks represent subgroups of j.
I needed to do just this, so I wrote a function to do it!
You can get hold of it by running
source("http://db.tt/22hmSliJ")
in R. This provides you with a function called normalise with the following arguments
dataframe – self explanatory
columns – a quoted variable name (e.g. “weight”) actually only works on a single column currently so this is a bit of a misnomer. But its easy enough to loop it**
by – one or two grouping factors, again quoted and enclosed in c() if there are two
na.rm – logical, remove any NAs? Defaults to TRUE
data <- normalise(data, "weight", by="sex")
to normalise weight according to sex, or
data <- normalise(data, "weight", by=c("age", "sex"))
to normalise weight by age and sex.
The function adds a column to the original dataframe with the original name preceded by “norm.”, so in this case it would be “norm.weight”.
Currently it only works if the by argument is a factor, but I shall change that at some point and update this post. It might also change the order of the dataframe, but thats not so much of a big deal I dont think.
Hope it helps!
* e.g. Risch AC, Jurgensen MF, Frank DA (2007) Effects of grazing and soil micro-climate on decomposition rates in a spatio-temporally heterogeneous grassland. Plant and Soil 298:191-201
**
for(i in c("height", "weight", "eye_colour")){ data <- normalise(data, i, by="weight") }
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.