Normalising data within groups

[This article was first published on Insights of a PhD student » R, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Occasionally it proves useful to normalise data. By this I mean to scale it between zero and one. Admittedly, most people frown of this but there are papers out there with this method in use*.

How do we go about this? Its a very simple formula to calculate:

y'[i] = y[i]/sqrt(sum(y^2))

So we square all of the ys, add them up and take the square root (call in the denominator). Then we divide each individual y value by the denominator.

In R this is simple – for instance decostand in the vegan package does exactly this (plus a whole heap of other standardisations).

But what I couldnt find was a function to take it a step further, a function that normalised within groups:

y'[ij] = y[ij]/sqrt(sum(y^2[j]))

The difference here are the js of course. Or to go a step further still:

y'[ijk] = y[ijk]/sqrt(sum(y^2[jk]))

where the ks represent subgroups of j.

I needed to do just this, so I wrote a function to do it!

You can get hold of it by running

source("http://db.tt/22hmSliJ")

in R. This provides you with a function called normalise with the following arguments

dataframe – self explanatory

columns – a quoted variable name (e.g. “weight”) actually only works on a single column currently so this is a bit of a misnomer. But its easy enough to loop it**

by – one or two grouping factors, again quoted and enclosed in c() if there are two

na.rm – logical, remove any NAs? Defaults to TRUE

data <- normalise(data, "weight", by="sex")

to normalise weight according to sex, or

data <- normalise(data, "weight", by=c("age", "sex"))

to normalise weight by age and sex.

The function adds a column to the original dataframe with the original name preceded by “norm.”, so in this case it would be “norm.weight”.

Currently it only works if the by argument is a factor, but I shall change that at some point and update this post. It might also change the order of the dataframe, but thats not so much of a big deal I dont think.

Hope it helps!

* e.g. Risch AC, Jurgensen MF, Frank DA (2007) Effects of grazing and soil micro-climate on decomposition rates in a spatio-temporally heterogeneous grassland. Plant and Soil 298:191-201

**

for(i in c("height", "weight", "eye_colour")){
data <- normalise(data, i, by="weight")
}

To leave a comment for the author, please follow the link and comment on their blog: Insights of a PhD student » R.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)