[This article was first published on Robin Lovelace - R, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
I recently attended a week-long R course in Newcastle, taught by Colin Gillespie. It went from “An Introduction to R” to “Advanced Graphics” via a day each on modelling, efficiency and programming. Suffice to say it was an intense 5 days!
Overall it was the best R course I’ve been on so far. I’d recommend it to others, from advanced users to adventurous novices. Below I explain why, with a brief description of each day and an emphasis on day 2.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Day 1
Day 1 introduced R in the context of alternatives: Python, SAS and the dreaded Microsoft Excel. Amusingly, Excel was often used as a reason of why to use R. I skipped the second half of the day but from what I saw it seemed a fun and practical introduction to an ostensibly dry subject.Day 2
On the second day, we moved firmly into the realm of applications with a focus on statistics. In addition to new insights into R, I also gained new insights into statistical test. I wasn’t expecting this when signing up to an R course but it was well worth it. Colin’s description of the T-test, the related QQ plot and other commonly used tests were some of the clearest I’ve ever heard. Quirks in R were raised, providing insight into its origins, one example being the number of datasets lying around: “An artifact of R being designed by statisticians is that there are loads of random datasets just lying around.”Insights into T tests in R
sleep
is a dataset in base R on the impact of different drugs on sleeping patterns of 10 students, which is actually rather useful for explaining t tests in R.
head(sleep) ## extra group ID ## 1 0.7 1 1 ## 2 -1.6 1 2 ## 3 -0.2 1 3 ## 4 -1.2 1 4 ## 5 -0.1 1 5 ## 6 3.4 1 6A trick I didn’t know was that
tapply
can be used to run a command on sub-groups within a single vector:
tapply(sleep$extra, INDEX = sleep$group, FUN = sd) ## 1 2 ## 1.789 2.002This shows that the standard deviation of extra sleep values are comparable, with the second group’s response showing slightly higher variability. These are the types of test needed to check whether the t test is appropriate. Because a t test is appropriate, we go ahead and run it, resulting in a significant difference between the two drugs:
t.test(extra ~ group, paired = TRUE, data = sleep, ) ## ## Paired t-test ## ## data: extra by group ## t = -4.062, df = 9, p-value = 0.002833 ## alternative hypothesis: true difference in means is not equal to 0 ## 95 percent confidence interval: ## -2.4599 -0.7001 ## sample estimates: ## mean of the differences ## -1.58
A new insight into lists
The list is a well-known data class in R, used to contain collections of any other objects. I’ve always known that list objects should be referred to using the double square bracket notation – e.g.[[5]]
. But what I never knew, until attending this course, was why. Lets take a look at a simple example:
lst <- list(x = c(1:3), y = "z", z = c(2:4)) lst[[1]] ## [1] 1 2 3 lst[1] ## $x ## [1] 1 2 3We can see the second output is printed differently: the only difference between the single and double notation is the output type. Double square brackets output the object’s inate data type; single square brackets always output another list. This explains the error in the following code:
lst[1:2] ## $x ## [1] 1 2 3 ## ## $y ## [1] "z" lst[[1:2]] ## [1] 2 lst[1:3] ## $x ## [1] 1 2 3 ## ## $y ## [1] "z" ## ## $z ## [1] 2 3 4 lst[[1:3]] ## Error: recursive indexing failed at level 2In the second line the output seems unintuitive. This is how it works:
1:2
is the same as c(1, 2)
, a vector of length 2 and each number refers to a different ‘level’ in the list. Therefore [[c(2, 2)]]
fails where [c language="(2,2)"][/c]
succeeds:
lst[[c(2, 2)]] ## Error: subscript out of bounds lst[c(2, 2)] ## $y ## [1] "z" ## ## $y ## [1] "z"
Adding interaction in ANOVA models
Another interesting little tidbit of knowledge I learned on the course was that the:
symbol can be used in R formulae to represent interaction between two independent variables. Thus, y ~ a + b + a:b
includes the influence of a
and b
on y
, as well as the impact of their interaction. However, R users usually save time by writing this simply as y ~ a * b
.
To test this example was provided – the latter two tests generate the same result (not shown):
m1 <- aov(weightgain ~ source, data = ratfeed) summary(m1) m2 <- aov(weightgain ~ source * type, data = ratfeed) summary(m2) m3 <- aov(weightgain ~ source * type, data = ratfeed) summary(m3)
No-nonsense introduction to clustering
Clustering is something that always seemed rather mysterious to me. The criteria by which observations are deemed ‘similar’ appear arbitrary. Indeed, Colin stated that ‘clustering never proves anything’ and that there are always infinite ways to cluster observations. The no-nonsense description was a breath of fresh air compared with other descriptions of clustering which hide the process behind a wall of jargon. The simplest way to cluster data is via ‘Euclidean distance’ between each combination of observations. The distance between observation 1 and 2, for example, is simply, the sum of squared differences between each variable:(x11−x21)2+(x12−x22)2+...+(xn2−xn2)2
Day 3
Day 3 provided a solid introduction to programming in R with attention paid toif
statements, function creations, the bizarre yet useful apply
family and recommended R workflows.
It was great to have a solid description of if statements in R, which I’d used tentatively before but never felt 100% sure using. The following example was used to demonstrate multiple ifs and the final else that completes such structures:
x <- 5 if(x > 10){ message("x is greater than 10") } else if(x > 5){ message("x is greater than 5") } else if(x > 0){ message("x is greater than 0") } else if(x > -5){ message("x is greater than -5") } else { message("x is less than -5") } ## x is greater than 0Also, Colin’s description of R workflows was very useful: I already used a variant of it, but the description of how to structure work reaffirmed the importance of order for me and will be of use next time I teach R. This is a variant of the file structure Colin recommends, that I’m implementing on an R for spatial microsimulation book project I’m working on:
|-- book.Rmd |-- data |-- figures |-- output |-- R | |-- load.R | `-- parallel-ipfp.R `-- spatial-microsim-book.Rproj
Day 4
On Day 4 we were guided through benchmarking in R and its importance for developing fast code. I learned about the parallelpackage which is hidden away in base R (its functions are only activated withlibrary(parallel)
— try it!). parallel provides parallel versions of some of the apply
functions, such as parApply
. This is really useful if you have speed critical applications running on multi-core devices.
Parallelisation might get you a speed boost of a factor of 2 to 4, but dumb code can hit you with a 100 fold speed penalty.Therefore the priority should be to check your code for bottlenecks before even considering parallelisation.
A good example is avoiding object recreation/expansion during for loops: create the vector size you want outside the loop. This explains why m1
is so much slower than the rest. We also looked at vectorisation: m3
illustrates that vectorised code is often faster than for
loop alternatives:
# method 1 m1 <- function(n){ mvec = NULL for(i in 1:n){ mvec = c(mvec, i) } } # method 2 m2 <- function(n){ mvec = numeric(n) for(i in 1:n){ mvec[i] <- i } } # method 3 m3 <- function(n){1:n} library(microbenchmark) n = 10000 mb <- microbenchmark(m1(n), m2(n), m3(n), times = 2L) print(mb) ## Unit: microseconds ## expr min lq median uq max neval ## m1(n) 161579.59 161579.59 167862.32 174145.05 174145.05 2 ## m2(n) 8500.44 8500.44 9222.45 9944.45 9944.45 2 ## m3(n) 19.66 19.66 21.49 23.33 23.33 2
Day 5
The was the reward for all our hard work on the other days: making pretty pictures. We looked into the internals of ggplot2and found a powerhouse graphics package inside. A guy sitting next to me used the insight gleaned from this final day to re-draw a plot he was presenting the subsequent week, so it was certainly useful!Conclusion
Overall it was well worth the time and money and I look forward to attending the more advanced course on “Advanced Programming in R” and package creation. The course materials were excellent: clear, concise and entertaining at times (the first R tutorials I’ve seen plastered with cartoons!). The practicals were based on a series of R packages that can be installed for free. Simply typing the following will give you an insight into the teaching on offer, and allow you to play Monopoly in R!install.packages("nclRintroduction", repos = "http://r-forge.r-project.org") install.packages("nclRmodelling", repos = "http://r-forge.r-project.org") install.packages("nclRprogramming", repos = "http://r-forge.r-project.org/") install.packages("nclRefficient", repos = "http://r-forge.r-project.org") library(nclRprogramming) vignette(package = "nclRprogramming") vignette(topic = "practical2b", package = "nclRprogramming")If you struggle with any of the questions, maybe you need to attend an R course or read a book to brush up! Stackoverflow is great but for some things you cannot beat face to face interaction or the depth of a book.
To leave a comment for the author, please follow the link and comment on their blog: Robin Lovelace - R.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.