Get Better: R for cell biologists
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
How can we teach “R for cell biologists” rather than teaching R to cell biologists?
I’ve noticed that many R training courses will teach R – regardless of who is taking the course – and leave it to the participants to figure out how they can use R in their own discipline. Often, folks from my lab will take an R course and spend a half-day making some plots of the iris
dataset or calculating miles per gallon using mtcars
. Once they leave the room, it is hard for them to connect what they learned to a real-life use case in the lab. This is not a criticism of those courses, I have run a workshop like that, but the question is how can we empower cell biologists (or other wet lab scientists) to use R for their work?
This is my first attempt at solving this problem. The materials and brief descriptions are in this GitHub repo. Below is a full description. Skip to here for thoughts on how it went.
“R for Cell Biologists” workshop
Introduction
Three concepts need to be introduced:
1. why we use R rather than Microsoft Excel
Emphasise reproducibility, automation, and publication-quality graphics.
2. the pathway from experiment to figure
A typical experiment involves setting up cells, imaging them on the microscope, analysing the images in Fiji, and the output of the analysis is plain text files, one for each image.
To make a figure, we need to process all of these text files and turn them into publication-quality figures, using R.
3. the steps in R are always the same
We need to:
- Read in the data
- Do some calculations or processing (optional)
- Make some plots
Additionally…
Depending on the experience of the group, other concepts may need introducing:
- RStudio as an IDE (what the different panes are for)
- R as a language
- scripting basics
- 1-based vs 0-based languages
- base R, tidyverse, packages
The hands-on part
Before we tackle steps 1-3 of an R analysis, we have a step 0 which is to set up an R project to work reproducibly.
0. R Project setup
- Start a new project in RStudio: File > New Project… then New Directory, name the project (suggestion: training_yymmdd) and save somewhere on your computer.
- Run the script
00_r_project_setup.R
or paste this gist into the console and press enter. - Using the course materials, move or copy the scripts into
Script/
and the data files intoData/
Key concept: a standardised directory structure within the R Project folder helps us to easily process data and save the outputs to a standardised place.
Key concept: We always use the R Project folder as our working directory.
It makes the project portable and doesn’t rely on paths to folders on a specific computer.
Challenges to success: the aim is to get participants to making a plot of the data promptly. If the steps of setup and data import take too long, participants can get lost, so the clock is ticking…
Steps 1-3 execute the script 02_training.R
line by line (cmd + Enter on Mac; ctrl + Enter on Windows/Linux), explaining what each line does as you go.
Check for understanding throughout.
1. Read in the data
Goal: make one data frame containing all the data
- Begin by reading in one file into a data frame, explain that we have 80 files to read in.
- Show how we can read all of them in to one huge data frame using a simple command
- But how do we know which rows belong to what condition and/or which experimental repeat?
- Use the filename to append information as it is read in.
## 1. Load data ---- # read one csv file into R as an object called temp temp <- read.csv("Data/control_n1_1.csv") # have a look at it View(temp) # remove the object rm(temp) # we want to load all files, so we need to get a list of all files list.files("Data") # assign this list as an object filelist <- list.files("Data") # use the path to the file filelist <- list.files("Data", full.names = TRUE) # load all files and rbind into big dataframe df <- do.call(rbind, lapply(filelist, read.csv)) View(df) # load all files and rbind into big dataframe, add a column to each file to identify each file df <- do.call(rbind, lapply(filelist, function(x) {temp <- read.csv(x); temp$file <- x; temp})) # this can be written on multiple lines df <- do.call(rbind, lapply(filelist, function(x) { temp <- read.csv(x) temp$file <- x temp })) # this is close to what we want but remember that we used full paths to load the files, we only want the file name # we can use the basename function to extract the file name df <- do.call(rbind, lapply(filelist, function(x) { temp <- read.csv(x); temp$file <- basename(x); temp })) # file column has name of the file, name is of the form foo_bar_1.csv, extract foo and bar into two columns df$cond <- sapply(strsplit(df$file, "_"), "[", 1) df$expt <- sapply(strsplit(df$file, "_"), "[", 2) # explain why the above works # strsplit(df$file, "_") returns a list of vectors, each vector is the result of splitting the string by "_" # we can extract the first element of each vector using "[", 1 # we can extract the second element of each vector using "[", 2 # sapply applies the function to each element of the list # what does "[" do? it extracts elements from a vector # what does "[[" do? it extracts elements from a list
A whiteboard illustration can be used to show participants how importing 1 or 80 files works. It can be helpful to visualise how we add extra columns to the data frame to identify them.
Key concept: think about how you’ll name the outputs of your analysis in Fiji to make reading the data into R as easy as possible.
An alternative workflow that we use in the lab is to use nested folders for conditions and experimental repeats, rather than a flat structure as in this example.
In that case, folder names are used to append information to the data frame.
Challenges to success: we use a very simple “one-liner” in base R to load in the data. Unfortunately, it is complicated to explain how this works. DO NOT get caught up in explaining alternatives. Of course there are many ways to achieve this part. The workshop does not use for-loops and the participants do not need to understand them to get to their goal. Remember, they are not here to learn R programming per se.
2. Do some calculations
This is an optional step.
Some examples are shown but they are not needed for this exercise, as we will simply plot the data.
## 2. Calculations ---- # normalise the data in the Mean column - just as an example df$MeanNorm <- df$Mean / max(df$Mean) # standardise the data in the Mean column - just as an example df$MeanStand <- (df$Mean - mean(df$Mean)) / sd(df$Mean) # since we have messed things up a bit, we can remake the df easily # run the top part again to remake the df and then save the output filelist <- list.files("Data", full.names = TRUE) df <- do.call(rbind, lapply(filelist, function(x) { temp <- read.csv(x); temp$file <- basename(x); temp })) df$cond <- sapply(strsplit(df$file, "_"), "[", 1) df$expt <- sapply(strsplit(df$file, "_"), "[", 2) # write data to file write.csv(df, "Output/Data/df.csv")
Key concept: use this part to re-emphasise the power of scripting by showing how we can recreate the dataframe easily in just a few lines of code.
3. Make some plots
- Use ggplot to make some plots
- Explain grammar of graphics and demonstrate the power of facetting, theming and so on
- Explore the data, notice that one experiment is different to the others
- Make a SuperPlot
- Save the SuperPlot
## 3. Visualisation ---- # we will use ggplot2 for visualisation # note that library loading usually goes at the top of the script! library(ggplot2) # histogram to look at the data ggplot(df, aes(x = Mean, fill = cond)) + geom_histogram() # density plot to look at the data ggplot(df, aes(x = Mean, fill = cond)) + geom_density(alpha = 0.5) # facetting, make plots by expt ggplot(df, aes(x = Mean, fill = cond)) + geom_density(alpha = 0.5) + facet_wrap(~expt) # themes ggplot(df, aes(x = Mean, fill = cond)) + geom_density(alpha = 0.5) + facet_wrap(~expt) + theme_minimal() ggplot(df, aes(x = Mean, fill = cond)) + geom_density(alpha = 0.5) + facet_wrap(~expt) + theme_bw() # superplot ggplot(df, aes(x = cond, y = Mean, colour = expt)) + geom_jitter() library(ggforce) ggplot(df, aes(x = cond, y = Mean, colour = expt)) + geom_sina() ggplot(df, aes(x = cond, y = Mean, colour = expt)) + geom_sina(position = "auto") # calculate experiment means summary_df <- aggregate(Mean ~ cond + expt, data = df, FUN = mean) # add to plot ggplot(df, aes(x = cond, y = Mean, colour = expt)) + geom_sina(position = "auto", alpha = 0.5) + geom_point(data = summary_df, aes(x = cond, y = Mean, colour = expt), shape = 15, size = 3) # keep adding to this or p <- ggplot(df, aes(x = cond, y = Mean, colour = expt)) + geom_sina(position = "auto", alpha = 0.5) + geom_point(data = summary_df, aes(x = cond, y = Mean, colour = expt), shape = 15, size = 3) p + theme_minimal() p + theme_bw() p + theme_bw() + scale_color_manual(values = c("#4477aa", "#ccbb44", "#ee6677","#228833")) # save plot ggplot(df, aes(x = cond, y = Mean, colour = expt)) + geom_sina(position = "auto", alpha = 0.5, maxwidth = 0.2) + geom_point(data = summary_df, aes(x = cond, y = Mean, colour = expt), shape = 15, size = 3) + scale_color_manual(values = c("#4477aa", "#ccbb44", "#ee6677","#228833")) + labs(x = "", y = "Mitochondria Fluorescence (A.U.)") + theme_bw(10) + theme(legend.position = "none") ggsave("Output/Plots/output.png", width = 6, height = 4, dpi = 300) # this is an easy way # stats # t-test t.test(summary_df$Mean ~ summary_df$cond)
In this part you will use two libraries {ggplot2}
and {ggforce}
. Participants will see some plots for the first time and use this as a chance to talk about the data in cell biological terms. The dataset is designed to have an irregularity (see below) and so you can use this as a chance to get the participants to flex their cell biological knowledge of why that may be!
Key concept: the data frame you made has all the information to make any plot you’d need.
Homework
To consolidate the learning, ask the participants to figure out how to do the following:
Which row (from which experiment/condition/cell) had the lowest Mean value?
Next, explain that the person who did the experiments found out that the rapamycin used in the 4th experiment, was prepared from a stock solution which was at the wrong concentration.
How can we exclude the n4 data and remake a new SuperPlot so that all experiments used the correct concentration?
Explain that getting assistance from an LLM is unlikely to help them learn. Searching for a solution is fine. Let them know a keyword to look up: subset/subsetting.
Solutions can be discussed in a follow-up session or via slack etc.
How did it go?
I’ve run this training workshop twice and the feedback has been good both times. It has definitely resulted in some lab members, who previously avoided R, to start using it for their own data analysis. In fact, they use the script for this session as the basis for their own. This has had the unintended consequence of standardising lab scripts, which in turn makes them easier to troubleshoot.
The first time we ran the training, we also paired it with a Fiji workshop so that participants could think about the entire workflow. For this we used an image analysis problem that needed solving. This worked quite well, but since the solution was unknown, it was hard to tie it to the R workshop.
I emphasise that when thinking about the experiment-to-figure workflow, lab members should consider their analysis and how to make it easier for themselves. For example, when they are at the microscope, naming files is important because they get carried through: to the Fiji step and then into R. Any inconsistencies cause headaches. Again, if this improves lab members work habits, that’s only a good thing.
The materials can probably be refined. As they are, they take 90 min to cover and if there’s questions, it can get quite tight. Extending the middle section using an example that requires more calculations would be useful to really show off what R can do, but this would obviously require more time.
—
The post title comes from Get Better by The New Fast Automatic Daffodils. The version I have is on a compilation of Martin Hannett produced tracks called “And Here Is The Young Man”.
Part of a series on development of lab members’ skills.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.