Assumption Checking – Part I
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Often when working, we are under deadlines to produce results in a reasonable timeframe. Sometimes an analyst may not check his assumptions if he is under a tight deadline. A simple example to illustrate this would be a one sample t-test. You might need to test your sample to see if the mean is different from a specific number. One assumption of a t-test that is often overlooked, is that the sample needs be drawn randomly from the population and the population is suppose to follow a Gaussian distribution. When is the last time in the workplace that you heard of someone performing a normality test before running a t-test? It is considered an extra step that is not usually taken. It should really not be considered a burden and can easily be accomplished with a wrapper function in R.
mytest <- function(x, value=0) { xx <- as.character(substitute(x)) if(!is.numeric(x)) stop(sprintf('%s is not numeric', xx)) if(shapiro.test(x)$p.value>.10){ print(t.test(x, mu=value)) }else{ print(wilcox.test(x, mu=value)) }}
We can combine that with another function to produce a density plot.
myplot <- function(x,color="blue"){ xx <- as.character(substitute(x)) if(!is.numeric(x)) stop(sprintf('%s is not numeric', xx)) title <- paste("Density Plot","n","Dataset = ",deparse(substitute(x))) mydens <- density(x) plot(mydens,main=title,las=1) polygon(mydens,col=color) }
Now, let's see how our functions work. If we generate some random values from a Gaussian distribution, we would expect it to "normally" pass a normality test and a t-test to be performed. However, if we had data that was generated from another distribution that is not 'normal', than typically we would expect to see the results from the Wilcox test.
set.seed(123) n <- 1000 normal <- rnorm(n,0,1) chisq <- rchisq(n,df=5) mytest(normal) myplot(normal) #Test for difference from 5 for chi-square data mytest(chisq,value=5) myplot(chisq ,color="orange")
Results from 'mytest(normal)':
One Sample t-test
data: x
t = 0.5143, df = 999, p-value = 0.6072
alternative hypothesis: true mean is not equal to 0
95 percent confidence interval:
-0.04541145 0.07766719
sample estimates:
mean of x
0.01612787
Results from 'mytest(chisq,value=5)':
Wilcoxon signed rank test with continuity correction
data: x
V = 214385, p-value = 8.644e-05
alternative hypothesis: true location is not equal to 5
Conclusion
The benefit of working ahead can be seen. Once you have these functions written you can add them to your personal R package that you host on github. Then you will be able to use them whenever you have an internet connection and the whole R community has the chance to benefit. Also, it is easy to combine these two functions into one.
#Combine the functions PlotAndTest <- function(x){ mytest(x) myplot(x) }
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.