Tapply on steroids – by (climate trends in Scottish seasons)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
A while back I wrote a post on using the R tool tapply, which is my most popular post. Tapply allows you to repeat a function over a vector split into groups, e.g. you’ve two columns of data with the first being Date and the second being a variable (snow days, banana imports, whatever) and you want to look at your variable split by month, year, etc.. The examples in this post use Met Office data I’ve previously used for September weather and Scottish winters. You can download these and write them to a database to follow this post using my Bitbucket repository. An example of tapply is to find the mean over each season, i.e.:
# Setup library(RSQLite) # Get all Scottish data from db data = dbGetQuery(clim.db, "SELECT Year, Month AS Season, Variable, Vals FROM data WHERE Country='Scotland' AND (Month='SPR' OR Month='SUM' OR Month='AUT' OR Month='WIN')") # reorder factor data$Season = factor(data$Season, levels=c("SPR", "SUM", "AUT", "WIN")) # relabel factor data$Season = factor(data$Season, labels=c("Spring", "Summer", "Autumn", "Winter")) # reorder factor data$Variable = factor(data$Variable, levels=c("frost", "maxtemp", "meantemp", "mintemp", "precipdays", "precip", "sun")) # relabel factor data$Variable = factor(data$Variable, labels=c("Frost days", "Max temp (degC)", "Mean temp (degC)", "Min temp (degC)", "Precipitation days", "Precipitation (mm)", "Sunshine (hrs)")) # ---------------------------------------------- # ---------------------------------------------- # Find the mean value for each season data.mean = tapply(data$Vals, list(data$Season, data$Variable), mean) # Find the mean value and ignore NAs data.mean = tapply(data$Vals, list(data$Season, data$Variable), mean, na.rm=T)
So now data.mean is a matrix of means with rows as seasons and columns as climate attributes. Great! But not the point of this post, I covered this previously. What tapply can’t do is iterate over more than a vector, i.e. it can only apply your function to a single vector at a time. If you wanted to iterate over two columns finding the correlations between them, it can’t do this. Which brings me neatly onto the function by. Using our previously loaded data, we can now make a Pearson’s correlation between years and variables. This is basically r, of r² fame, where the sign indicates the direction of correlation; i.e. a negative value indicates a negative correlation (y decreases for increasing x) and visa versa. The closer to 1 or -1 this correlation gets the better the line fit to the data points. However this can be misleading, e.g. if you only have two points you will always get a correlation of 1 or -1. Here this is in practice:
# Correlation data.cor = by(data, list(data$Season, data$Variable), function(i){cor(x=i[,1], y=i[,4], method="pearson")}) # So the explaining variable (x) is the first column of data (Year) # and the dependent variable (y) is the fourth column of data (Vals)
Now data.cor is a list (sort of) of correlations for each variable and season. Have a look, they’re pretty terrible. This list is slightly awkward to work with, the following code puts it into a useable format:
# Tidy correlation list data.cor = stack(data.frame(rbind(data.cor))) data.cor = data.frame(Season=sort(unique(data$Season)), Variable=data.cor[,2], Correlation=round(data.cor[,1], 2))
We can also repeat this for linear models/regression lines/best fit:
# Regression line data.lm = by(data, list(data$Season, data$Variable), function(i){lm(i[,1] ~ i[,4])}) # Tidy regression data data.lm = data.frame(Season=sort(unique(data$Season)), Variable=rep(sort(unique(data$Variable)), each=length(data.lm)/length(unique(data$Variable))), round(t(sapply(data.lm, coef)), 1)) names(data.lm)[3:4] = c("Intercept", "Slope")
Excitingly, we can turn these results into text strings for plot annotation:
# Turn correlation and regression into text annotation # \n means a new line text.ann = data.frame(data.lm[,1:2], lab=paste0("Correlation = ", data.cor[,3], "\nSlope = ", data.lm[,4], "\nIntercept = ", data.lm[,3]))
And then create a plot matrix of our data with results annotated on top:
# Plot results library(ggplot2) # Get plotting position for annotation text.ann$pl = rep(tapply(data$Vals, data$Variable, max, na.rm=T), each=4) # Write plot to png png("~/results.png", width=1600, height=2000, res=210) ggplot(data, aes(x=Year, y=Vals)) + geom_point(shape=1) + stat_smooth(method="lm", colour="black", size=1) + facet_grid(Variable ~ Season, scales="free_y") + ylab("Climate variables") + geom_text(data=text.ann[-c(6,10,14,25,26),], aes(label=lab, y=pl*.9), x=1930, size=2) + geom_text(data=text.ann[c(6,10,14,25,26),], aes(label=lab, y=pl*.3), x=1930, size=2) dev.off()
The results of this are the plot below:
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.