Central Limit Theorem
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
The Central Limit Theorem (CLT) is an important theory in statistics. It basically says that you can use all statistical tools and methods that assume a normal distribution on a sample of the full population. It does not matter how the population is distributed (normal, non-normal, uniform, etc.). If you use a large enough sample it will be normally distributed.
For mathematical proof of this theory you can for example look on the Wikipedia page. Here I just show a demonstration in R for four distribution types: Normal (Gaussian), Uniform, Exponential and Lognormal. Basically a matrix with n datapoints is created. From this matrix a sample of 2, 4 and 25 is taken. This is plotted later. N is a crucial parameter here, it defines the size of the population. The bigger it is the more the distribution reflects the theoretical shape of the distribution. (the top row in the image below). A larger n does requires some more computing time and memory. On my laptop with 8 Gb RAM I run into errors with n = 1·109.
As you can see from the plot, you will achieve a normal distribution with a sample size of 25.
# a: normal distribution # b: uniform distribution # c: Exponential distribution # d: lognormal distribution n <- 1E6 # Calculations a <- data.frame(matrix(rnorm(n,mean=10,sd=1), ncol=25)) a$n2 <- rowMeans(cbind(a[1:2]), dims=1) a$n4 <- rowMeans(cbind(a[1:4]), dims=1) a$n25 <- rowMeans(cbind(a[1:25]), dims=1) b <- data.frame(matrix(runif(n,min=1,max=10), ncol=25)) b$n2 <- rowMeans(cbind(b[1:2]), dims=1) b$n4 <- rowMeans(cbind(b[1:4]), dims=1) b$n25 <- rowMeans(cbind(b[1:25]), dims=1) c <- data.frame(matrix(rexp(n,rate=1), ncol=25)) c$n2 <- rowMeans(cbind(c[1:2]), dims=1) c$n4 <- rowMeans(cbind(c[1:4]), dims=1) c$n25 <- rowMeans(cbind(c[1:25]), dims=1) d <- data.frame(matrix(rlnorm(n,meanlog=10,sdlog=1), ncol=25)) d$n2 <- rowMeans(cbind(d[1:2]), dims=1) d$n4 <- rowMeans(cbind(d[1:4]), dims=1) d$n25 <- rowMeans(cbind(d[1:25]), dims=1)
Plotting these will give the following graphs. From top to bottom: different distibutions, samples from the population, sample sizes of 2 ,4 and 25. With the latter the distribution of the sample is normal (gaussian) meaning that all statistical tools which are based on this distribution can be used. (like for example the calculation of standard deviation).
References
- Kwong, C.W., 2009, The Use of R Language in the Teaching of Central Limit Theorem, National Institute of Education, Nanyang Technological University, Singapore, Asian Technology Conference on Mathematics. download
- Central Limit Theorem on Wikipedia
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.