Get Better: R for cell biologists

[This article was first published on Rstats – quantixed, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

How can we teach “R for cell biologists” rather than teaching R to cell biologists?

I’ve noticed that many R training courses will teach R – regardless of who is taking the course – and leave it to the participants to figure out how they can use R in their own discipline. Often, folks from my lab will take an R course and spend a half-day making some plots of the iris dataset or calculating miles per gallon using mtcars. Once they leave the room, it is hard for them to connect what they learned to a real-life use case in the lab. This is not a criticism of those courses, I have run a workshop like that, but the question is how can we empower cell biologists (or other wet lab scientists) to use R for their work?

This is my first attempt at solving this problem. The materials and brief descriptions are in this GitHub repo. Below is a full description. Skip to here for thoughts on how it went.

“R for Cell Biologists” workshop

Introduction

Three concepts need to be introduced:

1. why we use R rather than Microsoft Excel

Emphasise reproducibility, automation, and publication-quality graphics.

2. the pathway from experiment to figure

A typical experiment involves setting up cells, imaging them on the microscope, analysing the images in Fiji, and the output of the analysis is plain text files, one for each image.
To make a figure, we need to process all of these text files and turn them into publication-quality figures, using R.

3. the steps in R are always the same

We need to:

  1. Read in the data
  2. Do some calculations or processing (optional)
  3. Make some plots

Additionally…

Depending on the experience of the group, other concepts may need introducing:

  • RStudio as an IDE (what the different panes are for)
  • R as a language
  • scripting basics
  • 1-based vs 0-based languages
  • base R, tidyverse, packages

The hands-on part

Before we tackle steps 1-3 of an R analysis, we have a step 0 which is to set up an R project to work reproducibly.

0. R Project setup

  • Start a new project in RStudio: File > New Project… then New Directory, name the project (suggestion: training_yymmdd) and save somewhere on your computer.
  • Run the script 00_r_project_setup.R or paste this gist into the console and press enter.
  • Using the course materials, move or copy the scripts into Script/ and the data files into Data/

Key concept: a standardised directory structure within the R Project folder helps us to easily process data and save the outputs to a standardised place.
Key concept: We always use the R Project folder as our working directory.
It makes the project portable and doesn’t rely on paths to folders on a specific computer.

Challenges to success: the aim is to get participants to making a plot of the data promptly. If the steps of setup and data import take too long, participants can get lost, so the clock is ticking…

Steps 1-3 execute the script 02_training.R line by line (cmd + Enter on Mac; ctrl + Enter on Windows/Linux), explaining what each line does as you go.
Check for understanding throughout.

1. Read in the data

Goal: make one data frame containing all the data

  • Begin by reading in one file into a data frame, explain that we have 80 files to read in.
  • Show how we can read all of them in to one huge data frame using a simple command
  • But how do we know which rows belong to what condition and/or which experimental repeat?
  • Use the filename to append information as it is read in.
## 1. Load data ----
# read one csv file into R as an object called temp
temp <- read.csv("Data/control_n1_1.csv")
# have a look at it
View(temp)
# remove the object
rm(temp)

# we want to load all files, so we need to get a list of all files
list.files("Data")
# assign this list as an object
filelist <- list.files("Data")
# use the path to the file
filelist <- list.files("Data", full.names = TRUE)

# load all files and rbind into big dataframe
df <- do.call(rbind, lapply(filelist, read.csv))
View(df)
# load all files and rbind into big dataframe, add a column to each file to identify each file
df <- do.call(rbind, lapply(filelist, function(x) {temp <- read.csv(x); temp$file <- x; temp}))
# this can be written on multiple lines
df <- do.call(rbind, lapply(filelist, function(x) {
  temp <- read.csv(x)
  temp$file <- x
  temp
}))
# this is close to what we want but remember that we used full paths to load the files, we only want the file name
# we can use the basename function to extract the file name
df <- do.call(rbind, lapply(filelist, function(x) {
  temp <- read.csv(x);
  temp$file <- basename(x);
  temp
}))
# file column has name of the file, name is of the form foo_bar_1.csv, extract foo and bar into two columns
df$cond <- sapply(strsplit(df$file, "_"), "[", 1)
df$expt <- sapply(strsplit(df$file, "_"), "[", 2)
# explain why the above works
# strsplit(df$file, "_") returns a list of vectors, each vector is the result of splitting the string by "_"
# we can extract the first element of each vector using "[", 1
# we can extract the second element of each vector using "[", 2
# sapply applies the function to each element of the list
# what does "[" do? it extracts elements from a vector
# what does "[[" do? it extracts elements from a list

A whiteboard illustration can be used to show participants how importing 1 or 80 files works. It can be helpful to visualise how we add extra columns to the data frame to identify them.

Key concept: think about how you’ll name the outputs of your analysis in Fiji to make reading the data into R as easy as possible.

An alternative workflow that we use in the lab is to use nested folders for conditions and experimental repeats, rather than a flat structure as in this example.
In that case, folder names are used to append information to the data frame.

Challenges to success: we use a very simple “one-liner” in base R to load in the data. Unfortunately, it is complicated to explain how this works. DO NOT get caught up in explaining alternatives. Of course there are many ways to achieve this part. The workshop does not use for-loops and the participants do not need to understand them to get to their goal. Remember, they are not here to learn R programming per se.

2. Do some calculations

This is an optional step.
Some examples are shown but they are not needed for this exercise, as we will simply plot the data.

## 2. Calculations ----
# normalise the data in the Mean column - just as an example
df$MeanNorm <- df$Mean / max(df$Mean)
# standardise the data in the Mean column - just as an example
df$MeanStand <- (df$Mean - mean(df$Mean)) / sd(df$Mean)

# since we have messed things up a bit, we can remake the df easily
# run the top part again to remake the df and then save the output
filelist <- list.files("Data", full.names = TRUE)
df <- do.call(rbind, lapply(filelist, function(x) {
  temp <- read.csv(x);
  temp$file <- basename(x);
  temp
}))
df$cond <- sapply(strsplit(df$file, "_"), "[", 1)
df$expt <- sapply(strsplit(df$file, "_"), "[", 2)
# write data to file
write.csv(df, "Output/Data/df.csv")

Key concept: use this part to re-emphasise the power of scripting by showing how we can recreate the dataframe easily in just a few lines of code.

3. Make some plots

  • Use ggplot to make some plots
  • Explain grammar of graphics and demonstrate the power of facetting, theming and so on
  • Explore the data, notice that one experiment is different to the others
  • Make a SuperPlot
  • Save the SuperPlot
## 3. Visualisation ----
# we will use ggplot2 for visualisation
# note that library loading usually goes at the top of the script!
library(ggplot2)

# histogram to look at the data
ggplot(df, aes(x = Mean, fill = cond)) + geom_histogram()
# density plot to look at the data
ggplot(df, aes(x = Mean, fill = cond)) + geom_density(alpha = 0.5)
# facetting, make plots by expt
ggplot(df, aes(x = Mean, fill = cond)) + geom_density(alpha = 0.5) + facet_wrap(~expt)
# themes
ggplot(df, aes(x = Mean, fill = cond)) + geom_density(alpha = 0.5) + facet_wrap(~expt) + theme_minimal()
ggplot(df, aes(x = Mean, fill = cond)) + geom_density(alpha = 0.5) + facet_wrap(~expt) + theme_bw()

# superplot
ggplot(df, aes(x = cond, y = Mean, colour = expt)) + geom_jitter()
library(ggforce)
ggplot(df, aes(x = cond, y = Mean, colour = expt)) + geom_sina()
ggplot(df, aes(x = cond, y = Mean, colour = expt)) + geom_sina(position = "auto")
# calculate experiment means
summary_df <- aggregate(Mean ~ cond + expt, data = df, FUN = mean)
# add to plot
ggplot(df, aes(x = cond, y = Mean, colour = expt)) +
  geom_sina(position = "auto", alpha = 0.5) +
  geom_point(data = summary_df, aes(x = cond, y = Mean, colour = expt), shape = 15, size = 3)
# keep adding to this or
p <- ggplot(df, aes(x = cond, y = Mean, colour = expt)) +
  geom_sina(position = "auto", alpha = 0.5) +
  geom_point(data = summary_df, aes(x = cond, y = Mean, colour = expt), shape = 15, size = 3)
p + theme_minimal()
p + theme_bw()
p + theme_bw() +
  scale_color_manual(values = c("#4477aa", "#ccbb44", "#ee6677","#228833"))

# save plot
ggplot(df, aes(x = cond, y = Mean, colour = expt)) +
  geom_sina(position = "auto", alpha = 0.5, maxwidth = 0.2) +
  geom_point(data = summary_df, aes(x = cond, y = Mean, colour = expt), shape = 15, size = 3) +
  scale_color_manual(values = c("#4477aa", "#ccbb44", "#ee6677","#228833")) +
  labs(x = "", y = "Mitochondria Fluorescence (A.U.)") +
  theme_bw(10) +
  theme(legend.position = "none")
ggsave("Output/Plots/output.png", width = 6, height = 4, dpi = 300)
# this is an easy way

# stats
# t-test
t.test(summary_df$Mean ~ summary_df$cond)

In this part you will use two libraries {ggplot2} and {ggforce}. Participants will see some plots for the first time and use this as a chance to talk about the data in cell biological terms. The dataset is designed to have an irregularity (see below) and so you can use this as a chance to get the participants to flex their cell biological knowledge of why that may be!

Key concept: the data frame you made has all the information to make any plot you’d need.

Homework

To consolidate the learning, ask the participants to figure out how to do the following:

Which row (from which experiment/condition/cell) had the lowest Mean value?

Next, explain that the person who did the experiments found out that the rapamycin used in the 4th experiment, was prepared from a stock solution which was at the wrong concentration.

How can we exclude the n4 data and remake a new SuperPlot so that all experiments used the correct concentration?

Explain that getting assistance from an LLM is unlikely to help them learn. Searching for a solution is fine. Let them know a keyword to look up: subset/subsetting.

Solutions can be discussed in a follow-up session or via slack etc.

How did it go?

I’ve run this training workshop twice and the feedback has been good both times. It has definitely resulted in some lab members, who previously avoided R, to start using it for their own data analysis. In fact, they use the script for this session as the basis for their own. This has had the unintended consequence of standardising lab scripts, which in turn makes them easier to troubleshoot.

The first time we ran the training, we also paired it with a Fiji workshop so that participants could think about the entire workflow. For this we used an image analysis problem that needed solving. This worked quite well, but since the solution was unknown, it was hard to tie it to the R workshop.

I emphasise that when thinking about the experiment-to-figure workflow, lab members should consider their analysis and how to make it easier for themselves. For example, when they are at the microscope, naming files is important because they get carried through: to the Fiji step and then into R. Any inconsistencies cause headaches. Again, if this improves lab members work habits, that’s only a good thing.

The materials can probably be refined. As they are, they take 90 min to cover and if there’s questions, it can get quite tight. Extending the middle section using an example that requires more calculations would be useful to really show off what R can do, but this would obviously require more time.

The post title comes from Get Better by The New Fast Automatic Daffodils. The version I have is on a compilation of Martin Hannett produced tracks called “And Here Is The Young Man”.

Part of a series on development of lab members’ skills.

To leave a comment for the author, please follow the link and comment on their blog: Rstats – quantixed.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)