R: Basic R Skills – Splitting and Plotting
[This article was first published on compBiomeBlog, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
I am giving a short R course next year, so I am going to make a series of blog posts to help get my thoughts and example code in order. The aim is to introduce people with little or no experience of R to the language with self contained examples. The order of the posts are not going to reflect any order in the course, just what I feel like doing at the time.Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
This first post is going to deal with splitting and plotting data. It is a common occurrence to have data in such a form that you want to split the data in one column based on the data in another column. Maybe you want to split an experimental result by age or gender for example. Perhaps you want to see if there is a difference in the distribution of results in males and females. The example code below goes through one such hypothetical example.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
if (!require(RColorBrewer)){ | |
install.packages("RColorBrewer") | |
library(RColorBrewer) | |
} | |
### Generate a random data set | |
data <- data.frame(names=c("Type1","Type2")[as.numeric((runif(n=100)>=0.5))+1],data=rnorm(100,100,sd=25)) | |
### Use the aggregate function to split and get the mean of the data | |
aggregate(data$data,list(data$names),mean) | |
### Use the sapply and split functions to do the same thing | |
s <- split(data$data,list(data$names)) | |
sapply(s,mean) | |
# Or the same thing in one line | |
sapply(split(data$data,list(data$names)),mean) | |
### Below is a function which is just a group of commands used to stop | |
### you having to type the same code in again for another dataset | |
plotData <- function(data,cols){ | |
### Draw a box plot of the data | |
plot(data$data ~ data$names,col=cols,pch=20) | |
### Run a t-test on the split data | |
pval <- t.test(data$data ~ data$names)$p.value | |
### Are they significantly different ? | |
areSig <- c("Not Significant","Significant")[as.numeric(pval<=0.05)+1] | |
### Calculate the density of the data, after splitting | |
dens <- lapply(split(data$data,data$names),density) | |
### Draw an empty figure with the correct x and y limits of the data | |
plot(1,xlim=c(0,max(sapply(dens,function(x) max(x$x)))),ylim=c(0,max(sapply(dens,function(x) max(x$y))))) | |
### Draw the density plots for each data type | |
lapply(1:length(dens),function(x) lines(dens[[x]],col=cols[x],lwd=3)) | |
### Add a legens | |
legend("topleft",legend=names(dens),col=cols,lwd=4) | |
### Add a title with the p-value and wether it is significant or not | |
title(paste("P-value=",format.pval(pval),areSig)) | |
} | |
### Draw figures in a 2 x 2 grid | |
par(mfrow=c(2,2)) | |
### Run the plotData function on the data object | |
plotData(data,cols=brewer.pal(8,"Dark2")) | |
### Make a new version of the data object, which should be significantly different, as they have different means | |
data <- data.frame(names=rep(c("Type3","Type4"),each=50),data=c(rnorm(50,100,sd=20),rnorm(50,50,sd=10))) | |
### Plot the new version of the data | |
plotData(data,cols=brewer.pal(8,"Set1")) |
The figure shows the output you should get from running the code. Essentially the example is designed to illustrate the split function and the ~ (tilde) character.
The split function will do what it says, split a vector of data (A), based on another vector (B). It returns a list, with each element of the list being all of the element in A that match each element in B. For example
A <- c(1,2,3,4)
B <- c("X","Y","X","Y")
sp <- split(A,B)
sp
$X
[1] 1 3
$Y
[1] 2 4
Now we have a list, and we can operate on each element of the list using the apply functions, such as lapply.
lapply(sp,sum)
$X
[1] 4
$Y
[1] 6
There are lots off different apply functions, a good introduction is here.
The other main way of splitting is using the ~ (tilde) operation. In my head I always read this as ‘given‘, such as plot(A ~ B) is “plot A given B”. This is an example of the formula notation in R, but here we are using it very simply. It essentially does the same thing as split.
Note: You actually need to do plot(A ~ factor(B)) if B isn’t already a factor.
Lots of functions support the function call, such as t.test in the example, for others you can use the lapply and split version, such as for density in the example.
I also mention the aggregate function, which essentially is the same as lapply and split but seems slower on large datasets.
To leave a comment for the author, please follow the link and comment on their blog: compBiomeBlog.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.