Visualizing MonteCarlo Simulation Results: Mean vs Median

[This article was first published on R – first differences, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Simulation studies are used in a wide range of areas from risk management, to epidemiology, and of course in statistics. The MonteCarlo package provides tools to automatize the design of these kind of simulation studies in R. The user only has to specify the random experiment he or she wants to conduct and to specify the number of replications. The rest is handled by the package.

So far, the main tool to analyze the results was to look at Latex tables generated using the MakeTable() function. Now, the new package version 1.0.4 contains the function MakeFrame() that allows to represent the simulation results in form of a dataframe. This makes it very easy to visualize the results using standard tools such as dplyr and ggplot2.

Here, I will demonstrate some of these concepts for a simple example that could be part of an introductory statistics course: the comparison of the mean and the median as estimators for the expected value. For an introduction to the MonteCarlo package click here or confer the package vignette.

For a symmetric distribution both the mean and the median are consistent estimators of the expected value, but since the mean is the maximum likelihood estimator, it is more efficient. On the other hand, it is usually stressed that in contrast to the mean, the median is not sensitive to outliers.

To demonstrate this, and to explore the relative magnitude of these effects depending on the sample size, we formulate a suitable random experiment.

library(MonteCarlo)
set.seed(123)

mean_vs_median<-function(n, out){

  # generate sample
  sample<-rnorm(n, 0, 1)

  # add outlier
  sample[1]<-sample[1]+out

  # calculate estimators
  mean_sample<-mean(sample)
  median_sample<-median(sample)

  # return results
  return(list("mean"=mean_sample, "median"=median_sample))
}

The mean_vs_median() function generates a sample of n standard normal distributed random variables and contaminates the first observation with a deterministic outlier of size out. We then calculate both the mean and the median and return them as elements of a list. Note that the function has two arguments – the sample size n and the size of the outlier out.

To analyze the effect of these two parameters on the relative performance of the two estimators, we use the MonteCarlo package. All we have to do is to pass the mean_vs_median function to the MonteCarlo() function and to specify the parameter values we are interested in. In the example below, we look at outliers of size 0, 3, or 5 in a sample of size 5, 25, or 250 and the experiment is repeated 1000 times.

erg_mean_median<-MonteCarlo(func=mean_vs_median, nrep=1000, param_list=list("n"=c(5, 25, 250), "out"=c(0, 3, 5)))

summary(erg_mean_median)

## Simulation of function: 
## 
## function(n, out){
## 
##   # generate sample
##   sample<-rnorm(n, 0, 1)
##   
##   # add outlier
##   sample[1]<-sample[1]+out
## 
##   # calculate estimators
##   mean_sample<-mean(sample)
##   median_sample<-median(sample)
## 
##   # return results
##   return(list("mean"=mean_sample, "median"=median_sample))
## }
## 
## Required time: 2.61 secs for nrep = 1000  repetitions on 1 CPUs 
## 
## Parameter grid: 
## 
##    n : 5 25 250 
##  out : 0 3 10 
## 
##  
## 2 output arrays of dimensions: 3 3 1000

This is all the programming that is required to run the simulation study. To be able to generate plots from the results we call the MakeFrame() function on the object returned by the simulation.

df<-MakeFrame(erg_mean_median)
head(df)

##     n out        mean      median
## 1  25   0  0.85512352  1.28055488
## 2 250   0 -0.19633118 -0.09100042
## 3   5   0 -0.07376411 -0.06749168
## 4  25   3  0.59734761  0.59949205
## 5 250   3  0.13122706  0.28618942
## 6   5   3  0.07257960  0.12357383

As one can see, the result is a dataframe. Each row contains information on a single repetition of the random experiment. The first two columns contain the values of the parameters and the other two columns contain the estimates of the mean and the median.

To manipulate the dataframe and to make plots, we load the tidyverse package and convert the dataframe to a tibble.

library(tidyverse)
tbl <- tbl_df(df)

To compare the efficiency of the estimates in absence of an outlier, we focus on the cases where out=0 and then compare estimates of the distribution of the estimators for different sample sizes. For each sample size we use a different colour and the mean and the median can be distinguished by the linetypes.

ggplot(filter(tbl, out==0)) + 
  geom_density(aes(x=mean, col=factor(n), linetype="mean")) +
  geom_density(aes(x=median, col=factor(n), linetype="median"))

unnamed-chunk-7-1

It is clear to see that both the distribution of the mean and the median are centered around the true expected value of zero. This implies that both estimators are unbiased. However, the distribution of the mean tends to be more concentrated than that of the median. The mean has a smaller variance and is therefore more efficient.

This can also be seen if we calculate the corresponding summary statistics. Using the tibble and the dplyr package, this can be done with a single line of code.

tbl %>% filter(out==0) %>% group_by(n) %>% summarise_each("sd") %>% round(digits=2)

## # A tibble: 3 x 4
##       n   out  mean median
##       
## 1     5     0  0.06   0.08
## 2    25     0  0.44   0.52
## 3   250     0  0.19   0.24

We now consider the effect of outliers on the two estimators. To do so, we generate a similar plot, but now we keep the sample size constant a n=25 and different colors represent outliers of different magnitudes.

ggplot(filter(tbl, n==25)) + 
  geom_density(aes(x=mean, col=factor(out), linetype="mean")) +
  geom_density(aes(x=median, col=factor(out), linetype="median"))

unnamed-chunk-9-1

It is clear to see that the dashed lines representing the distribution of the medians are centered around the true mean of zero, irrespective of the size of the outlier. The distribution of the means, on the other hand, shifts further to the right the larger the magnitude of the outlier. This shows that the median is robust to outliers, whereas the mean is not.

Finally, we want to explore the interaction of the effect of the size of the outlier with the sample size. We therefore focus on the mean. Different colors now represent different sizes of the outlier and different linetypes represent different sample sizes.

ggplot(filter(tbl)) + 
  geom_density(aes(x=mean, col=factor(out), linetype=factor(n)))

unnamed-chunk-10-1

For an outlier of a given size, we can observe that its impact decreases as the sample size increases. While the effect of the outlier can be quite dramatic for n=5, the effect basically vanishes for n=250. The sensitivity of the mean to outliers is therefore a finite sample property that is less important in larger samples.

As can be seen from this example, conducting simulation studies requires minimal effort, if a package such as MonteCarlo or one of its competitors such as simsalapar or SimDesign is used. The programming required to produce this analysis should be simple enough so that simulations are not restricted to be a tool for research, but can even be used for teaching at an undergraduate level.

To leave a comment for the author, please follow the link and comment on their blog: R – first differences.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.