Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
By Chris Campbell
As another new baby card is passed around the office, and the latest cute baby pictures are emailed out, a discussion is underway. Could it be true? Something in the water? An elixir of fertility that should be bottled and sold to desperate couples for enormous profit? Is Mango having some crazy baby boom?!
Mango has been having something of a population explosion. In the first seven years of Mango there were only a couple of babies. Since moving to Methuen Park it seems like there’s been one every few months. So is there really something magical happening here?
Of course not. Now demonstrating this kind of data is an absolute HR nightmare, so please forgive me if I’m somewhat coy with revealing the raw datasets. However, I’d like to share a small analysis I did as an example of how to set up a dataset for survival analysis with multiple events. To investigate the hypothesis that there’d been a change in the rate of baby production, I had two datasets. One, babies
, has the Date of birth of the child, as well as some covariates. Splitting births into events in the first 3432 days (9.4 years) and most recent 1268 days (3.5 years) shows that 40% the number of babies were born before moving to our current home, Methuen Park, compared to when at the new offices. In fact, the rate jumps from 0.4 babies/year to 2.9 babies/year!
names(babies) # [1] "Name" "Date" "Sex" "Parent" library(dplyr) day1 <- as.Date("2002-10-01") summarize( .data = group_by(x = babies, Methuen = Date - day1 > 3432), Number = length(Date)) # Source: local data frame [2 x 2] # # Methuen Number # (lgl) (int) # 1 FALSE 4 # 2 TRUE 10
I helped Francis Smart develop his stick man script into a small package, stick, that you can install from GitHub. This allows me to demonstrate graphically how absolutely remarkable this step change in birth rate truly is.
There is a plotStick
function in the package, but since I want to demonstrate which time period is at Methuen without obscuring the sticks, I added them to a blank plot with pointsStick
. Each stick represents one additional birth. The sticks wear dresses if they are female, and are coloured to show unique parents. The mood of the face is random, to represent the changeable mood of a newborn. Sleeping stick is not yet implemented.
< details> < summary> R Code for Stick Plot
library(devtools) # install stick package install_github("EconometricsBySimulation/R-Graphics/Stick-Figures/stick") library(stick) # blank plot plot(x = babies$Date, y = seq_len(nrow(babies)), main = "Cumulative Baby Numbers for Mango Solutions Employees", xlab = "Date", ylab = "Total Babies", type = "n", ylim = c(0, 15)) # highlight time at Methuen Park methuen <- rainbow(1, s = 0.5, start = 0.2) polygon(x = day1 + c(3432, 4700, 4700, 3432), y = c(-2, -2, 17, 17), border = methuen, col = methuen) box() # add sticks pointsStick(x = babies$Date, y = seq_len(nrow(babies)), # female babies wear a dress I guess gender = babies$Sex, # colour by parent col = rainbow(length(unique(babies$Parent)), s = 0.9)[factor(babies$Parent)], # baby mood is unpredictable at best face = sample(x = c("happy", "neutral", "surprised", "annoyed", "sad"), size = nrow(babies), replace = TRUE, prob = (5:1) / 15), # babies rarely, if ever keep hats on hat = FALSE) # add labels text(x = as.Date(c("2006-03-30", "2012-05-29")), y = c(14, 14), labels = c("Before Methuen", "At Methuen"), pos = 4)
Plotting the number of babies born, it certainly looks like there’s a correlation between Methuen Park and babies. However, examining employee head count during the two periods trivially shows that baby birth number correlates with number of employees. Baby births and employee numbers are 40% the level at Methuen Park.
names(employeeno) # [1] "Date" "Employees" summarize( .data = group_by(x = employeeno, Methuen = Date - day1 > 3432), Number = median(Employees)) # Source: local data frame [2 x 2] # # Methuen Number # (lgl) (int) # 1 FALSE 22 # 2 TRUE 52
However, for the sake of fun, we can see whether there is a difference in birth rates between these two periods. One way of measuring the occurrence of events is survival analysis. A lot of tools for survival analysis are available in the survival package. Following the common modelling idiom in R, the model is defined as a formula. Events go on the left hand side of the formula, and are coded as a Surv object (a matrix of time and event status). For data where only one event can be observed for a subject, only one event time is observed. But people who have had a baby are still at risk of having another baby!
For multiple events in the Surv object, the data need to be shaped so that the time at risk of each individual is marked as start time, end time and status: event observed or subject censored (no event observed). Subjects can occur at multiple records.
library(data.table) tab2 <- data.table(tab) setkey(tab2, Employee) tab2[c("Andy Nicholls", "Chris Campbell", "Chris Musselle"), ] # Employee Start Stop Status #1: Andy Nicholls 3031 4408 1 #2: Andy Nicholls 4408 4595 0 #3: Chris Campbell 3361 3893 1 #4: Chris Campbell 3893 4479 1 #5: Chris Campbell 4479 4595 0 #6: Chris Musselle 4124 4595 0
In this experiment, subjects are at risk during two time periods. To divide this dataset into two groups based on site change at 3432 days since the start of Mango, we can use the foverlaps
function from package data.table.
< details> < summary> R Code for Splitting into Time Intervals
# split date dmethuen <- data.table(Start = 3432, Stop = 3432, Status = 0, key = c("Start", "Stop")) setkey(tab2, Start, Stop) # records where split occurs flagged in Start/Stop columns tab2 <- foverlaps(x = tab2, y = dmethuen, type = "any") setkey(tab2, Employee) tab2[c("Andy Nicholls", "Chris Campbell", "Chris Musselle"), ] # Start Stop Status Employee i.Start i.Stop i.Status # 1: 3432 3432 0 Andy Nicholls 3031 4408 1 # 2: NA NA NA Andy Nicholls 4408 4595 0 # 3: 3432 3432 0 Chris Campbell 3361 3893 1 # 4: NA NA NA Chris Campbell 3893 4479 1 # 5: NA NA NA Chris Campbell 4479 4595 0 # 6: NA NA NA Chris Musselle 4124 4595 0 # bind new columns as new rows, using new Status tab2 <- rbindlist( list( tab2[!is.na(Start), list(Employee, Start = i.Start, Stop, Status = Status)], tab2[!is.na(Start), list(Employee, Start = Start, Stop = i.Stop, Status = i.Status)], tab2[is.na(Start), list(Employee, Start = i.Start, Stop = i.Stop, Status = i.Status)])) # add flag column for modelling tab2[, Location := factor(x = Start >= 3432, levels = c(FALSE, TRUE), labels = c("Greenways", "Methuen"))] setkey(tab2, Employee) tab2[c("Andy Nicholls", "Chris Campbell", "Chris Musselle"), ] # Employee Start Stop Status Location #1: Andy Nicholls 3031 3432 0 Greenways #2: Andy Nicholls 3432 4408 1 Methuen #3: Andy Nicholls 4408 4595 0 Methuen #4: Chris Campbell 3361 3432 0 Greenways #5: Chris Campbell 3432 3893 1 Methuen #6: Chris Campbell 3893 4479 1 Methuen #7: Chris Campbell 4479 4595 0 Methuen #8: Chris Musselle 4124 4595 0 Methuen
This allowed me to use the Methuen or not-Methuen as a possible covariate. I created a Cox Proportional Hazards fit of event status with Location as an explanatory variable. There was no improvement in the likelihood of the model by adding a Location effect. The rates of baby creation for a person at risk is constant, whether at Methuen Park or the old offices. As we suspected, the only difference, which we can reasonably infer to be causative, is the number of person-days at risk in the two groups.
library(survival) newBabyLocs <- coxph( formula = Surv(Start, Stop, Status) ~ Location, data = tab) anova(newBabyLocs) # Analysis of Deviance Table # Cox model: response is Surv(Start, Stop, Status) # Terms added sequentially (first to last) # # loglik Chisq Df Pr(>|Chi|) # NULL -48.341 # Location -48.341 0 1 1
This approach suggests that Location doesn’t influence risk of babies. Absence of evidence does not constitute evidence of absence, so we can’t completely rule out an effect. However, supporting evidence strongly suggests that there is a causal relationship between moving to larger offices at Methuen Park and Mango population size. The increased number of babies born simply reflects the increased number of subjects at risk. So if there is something in the water, the effect size must be very small.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.