Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Background
Sometimes we might want to compare three or four tube types for a particular analyte on a group of patients or we might want to see if a particular analyte is stable over time in aliqioted samples. In these experiments are essentially doing the multivariable analogue of the paired t-test. In the tube-type experiment, the factor that is differing between the ('paired') groups is the container: serum separator tubes (SST), EDTA plasma tubes, plasma separator tubes (PST) etc. In a stability experiment, the factor that is differing is storage duration.
Since this is a fairly common clinical lab experiment, I thought I would just jot down how this is accomplished in R – though I must confess I know just about \(\lim_{x\to0}x\) about statistics. In any case, the statistical test is a repeated-measures ANOVA and this is one way to do it (there are many) including an approach to the post-hoc testing.
Some Fake Data to Work With
I'm going to make some fake data. I tried to dig up the data from an experiment I did as a resident but alas, I think the raw data died on an old laptop. But fake data will do for demonstration purposes. Let's suppose we are looking at parathyroid hormone (PTH) in three different vacutainer tubes: SST, EDTA and PST. For the sake of argument, let's say that we collect samples from 20 patients simultaneously and we anlayze them all as per our usual process. This means that each patient has three samples of material that should be otherwise identical outside of the effects of the collection contained.
library(magrittr) set.seed(100) #to force the same pseudo-random each time #data in pmol/L #induce some heteroscedastic error SST <- runif(20,3,50) PST <- 1.03*SST + rnorm(20,0,0.1)*SST #set the data up to show no difference EDTA <- 1.15*SST + rnorm(20,0,0.1)*SST #set the data up to show a difference tube.data <- data.frame(SST,PST,EDTA) %>% round(.,1) tube.data <- data.frame(Subject = factor(1:20), tube.data)
This is the way we usually express (and receive) data like this in an Excel spreadsheet:
Subject | SST | PST | EDTA |
---|---|---|---|
1 | 17.5 | 18.1 | 19.9 |
2 | 15.1 | 15.7 | 20.0 |
3 | 29.0 | 29.2 | 32.9 |
4 | 5.7 | 6.2 | 6.4 |
5 | 25.0 | 26.1 | 27.0 |
6 | 25.7 | 26.4 | 29.0 |
7 | 41.2 | 40.8 | 48.1 |
8 | 20.4 | 22.1 | 24.3 |
9 | 28.7 | 26.9 | 36.0 |
10 | 11.0 | 13.9 | 13.7 |
11 | 32.4 | 31.9 | 36.9 |
12 | 44.5 | 49.2 | 57.4 |
13 | 16.2 | 17.1 | 15.7 |
14 | 21.7 | 24.1 | 26.3 |
15 | 38.8 | 36.8 | 42.6 |
16 | 34.4 | 34.0 | 44.2 |
17 | 12.6 | 12.1 | 14.1 |
18 | 19.8 | 20.9 | 25.4 |
19 | 19.9 | 18.2 | 23.0 |
20 | 35.4 | 37.4 | 34.1 |
This Excel-ish way of storing the data is referred to as the “datawide” format for obvious reasons.
Gather the Grain
As it turns out this is not the way that we want to store data to do the statistical analyses of interest. What we want to do is have the tube type in a single column because this is the factor that is different within the subjects. We want to gather()
or melt()
the data (depending on your package of choice) to be like so:
library(tidyr) tube.data.2 <- gather(tube.data, key = "Subject") tube.data.2 %>% kable()
Subject | Subject | value |
---|---|---|
1 | SST | 17.5 |
2 | SST | 15.1 |
3 | SST | 29.0 |
4 | SST | 5.7 |
5 | SST | 25.0 |
6 | SST | 25.7 |
7 | SST | 41.2 |
8 | SST | 20.4 |
9 | SST | 28.7 |
10 | SST | 11.0 |
11 | SST | 32.4 |
12 | SST | 44.5 |
13 | SST | 16.2 |
14 | SST | 21.7 |
15 | SST | 38.8 |
16 | SST | 34.4 |
17 | SST | 12.6 |
18 | SST | 19.8 |
19 | SST | 19.9 |
20 | SST | 35.4 |
1 | PST | 18.1 |
2 | PST | 15.7 |
3 | PST | 29.2 |
4 | PST | 6.2 |
5 | PST | 26.1 |
6 | PST | 26.4 |
7 | PST | 40.8 |
8 | PST | 22.1 |
9 | PST | 26.9 |
10 | PST | 13.9 |
11 | PST | 31.9 |
12 | PST | 49.2 |
13 | PST | 17.1 |
14 | PST | 24.1 |
15 | PST | 36.8 |
16 | PST | 34.0 |
17 | PST | 12.1 |
18 | PST | 20.9 |
19 | PST | 18.2 |
20 | PST | 37.4 |
1 | EDTA | 19.9 |
2 | EDTA | 20.0 |
3 | EDTA | 32.9 |
4 | EDTA | 6.4 |
5 | EDTA | 27.0 |
6 | EDTA | 29.0 |
7 | EDTA | 48.1 |
8 | EDTA | 24.3 |
9 | EDTA | 36.0 |
10 | EDTA | 13.7 |
11 | EDTA | 36.9 |
12 | EDTA | 57.4 |
13 | EDTA | 15.7 |
14 | EDTA | 26.3 |
15 | EDTA | 42.6 |
16 | EDTA | 44.2 |
17 | EDTA | 14.1 |
18 | EDTA | 25.4 |
19 | EDTA | 23.0 |
20 | EDTA | 34.1 |
Now we see that there is a column for tube-type and a column for the PTH results which we can name accordingly. You can see why this called the “datalong” format.
names(tube.data.2) <- c("Subject", "Tube.Type", "PTH") tube.data.2$Tube.Type <- as.factor(tube.data.2$Tube.Type) #turns tube type into factor
Visualize
Summarize the data:
summary(tube.data)
## Subject SST PST EDTA ## 1 : 1 Min. : 5.70 Min. : 6.20 Min. : 6.40 ## 2 : 1 1st Qu.:17.18 1st Qu.:17.85 1st Qu.:19.98 ## 3 : 1 Median :23.35 Median :25.10 Median :26.65 ## 4 : 1 Mean :24.75 Mean :25.36 Mean :28.85 ## 5 : 1 3rd Qu.:32.90 3rd Qu.:32.42 3rd Qu.:36.23 ## 6 : 1 Max. :44.50 Max. :49.20 Max. :57.40 ## (Other):14
Let's just have a quick look graphically:
library(mcr) plot(mcreg(SST, EDTA, method.reg = "PaBa", mref.name = "SST", mtest.name = "EDTA"))
plot(mcreg(SST, PST, method.reg = "PaBa", mref.name = "SST", mtest.name = "PST"))
And as a boxplot with the points overtop:
boxplot(PTH ~ Tube.Type, data = tube.data.2, col = c("purple", "lightgreen", "gold")) stripchart(PTH ~ Tube.Type, vertical = TRUE, data = tube.data.2, method = "jitter", add = TRUE, pch = 20, col = rgb(0,0,0,0.5))
Separate the Wheat from the Chaff
Now we want to make comparisons to see if these are different. To accomplish this, we will use the aov()
function. This requires us to have data formatted “datalong” as it is in the tube.data.2
dataframe.
fit <- aov(PTH ~ Tube.Type + Error(Subject/Tube.Type), data=tube.data.2)
If you are like me, this syntax is confusing. But it goes like this. PTH
is a function of Tube.Type
which is straight forward–hence the PTH ~ Tube.Type
bit. The error term has the Subject
in front of the /
and the factor that is different within the subjects (Tube.Type
) after the /
. That's my grade 2 explanation from reading this and this and this.
summary(fit)
## ## Error: Subject ## Df Sum Sq Mean Sq F value Pr(>F) ## Residuals 19 7307 384.6 ## ## Error: Subject:Tube.Type ## Df Sum Sq Mean Sq F value Pr(>F) ## Tube.Type 2 195.9 97.97 22.47 3.63e-07 *** ## Residuals 38 165.7 4.36 ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
This tells us that there is a difference between the groups but it does not specify where the difference is.
I can't see the difference. Can you see the difference?
Sorry – I just had to make a pop-culture reference to this. We want to be specific about where the differences are without making a Type I error which might arise if we blindly charge ahead and do multiple paired t-tests. One easy way to accomplish this is to use the pairwise.t.test()
function which does corrections for multiple comparisons. You can choose from a number of approaches for adjustment for pairwise comparison. This requires the “response vector” which is PTH and the “grouping factor” which is the tube type.
# choices for p.adjust.method are: c("holm", "hochberg", "hommel", "bonferroni", "BH", "BY", "fdr", "none") pwt <- pairwise.t.test(tube.data.2$PTH, tube.data.2$Tube.Type, p.adj = "bonferroni", paired = TRUE) pwt
## ## Pairwise comparisons using paired t tests ## ## data: tube.data.2$PTH and tube.data.2$Tube.Type ## ## EDTA PST ## PST 0.00083 - ## SST 7.9e-05 0.35033 ## ## P value adjustment method: bonferroni
This is pretty easy to understand. There are statistically significant differences found between the EDTA and PST (p = 0.00083) and the EDTA and PST (p = 0.00008) but none between SST and PST (p = 0.35033).
Conclusion
Non-statistician's approach to tube-type comparisons, which is also applicable to analyte stability studies. This is a one-way repeated measures ANOVA with one within-subjects factor. There is a great deal more to say on the matter by people who know much more in the citations in the links provided above.
God probably uses datawide format
All the nations will be gathered before him, and he will separate the people one from another as a shepherd separates the sheep from the goats. He will put the sheep on his right and the goats on his left.
(Matt 25:32-33)
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.