Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
This is a little digression from Chapter 5 of Using R for Introductory Statistics that led me to the hypergeometric distribution.
Question 5.13 A sample of 100 people is drawn from a population of 600,000. If it is known that 40% of the population has a specific attribute, what is the probability that 35 or fewer in the sample have that attribute.
I’m pretty sure that you’re supposed to reason that 600,000 is sufficiently large that the draws from the population are close enough to independent. The answer is then computed like so:
> pbinom(35,100,0.4) [1] 0.1794694
Although this is close enough for practical purposes, the real way to answer this question is with the hypergeometric distribution.
The hypergeometric distribution is a discrete probability distribution that describes the number of successes in a sequence of k draws from a finite population without replacement, just as the binomial distribution describes the number of successes for draws with replacement.
The situation is usually described in terms of balls and urns. There are N balls in an urn, m white balls and n black balls. We draw k balls without replacement. X represents the number of white balls drawn.
R gives us the function phyper(x, m, n, k, lower.tail = TRUE, log.p = FALSE), which does indeed show that our approximation was close enough.
> phyper(35,240000,360000, 100) [1] 0.1794489
Since we’re down with OCD, let’s explore a bit further. First, since our population is defined and not too huge, let’s just try it empirically. First, create our population.
> pop <- rep(c(0,1),c(360000, 240000)) > length(pop) [1] 600000 > mean(pop) [1] 0.4 > sd(pop) [1] 0.4898984
Next, generate a boatload of samples and see how many of them have 35 or fewer of the special members.
> sums <- sapply(1:200000, function(x) { sum(sample(pop,100))}) > sum(sums <= 35) / 200000 [1] 0.17935
Pretty close to our computed results. I thought I might be able to compute an answer using the central limit theorem, using the distribution of sample means, which should be approximately normal.
> means <- sapply(1:2000, function(x) { mean(sample(pop,100))}) > mean(means) [1] 0.40154 > sd(means) [1] 0.0479998 > curve(dnorm(x, 0.4, sd(pop)/sqrt(100)), 0.2, 0.6, col='blue') > lines(density(means), lty=2)
Shouldn’t I be able to compute how many of my samples will have 35 or fewer special members? This seems to be a ways off, but I don’t know why. Maybe it’s just the error due to approximating a discreet distribution with a continuous one?
> pnorm(0.35, 0.4, sd(pop)/sqrt(100)) [1] 0.1537173
This fudge gets us closer, but still not as close as our initial approximation.
> pnorm(0.355, 0.4, sd(pop)/sqrt(100)) [1] 0.1791634
If anyone knows what’s up with this, that’s what comments are for. Help me out.
Notes on Using R for Introductory Statistics
Chapters 1 and 2
Chapter 3
- Categorical data
- Comparing independent samples
- Relationships in numeric data, correlation
- Simple linear regression
Chapter 4
Chapter 5
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.