Python and R: Basic Sampling Problem
[This article was first published on Analysis with Programming, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
In this post, I would like to share a simple problem about sampling analysis. And I will demonstrate how to solve this using Python and R. The first two problems are originally from Sampling: Design and Analysis book by Sharon Lohr.
Problems
- Let $N=6$ and $n=3$. For purposes of studying sampling distributions, assume that all population values are known.
$y_1 = 98$ $y_2 = 102$ $y_3=154$ $y_4 = 133$ $y_5 = 190$ $y_6=175$
We are interested in $bar{y}_U$, the population mean. Consider eight possible samples chosen.Sample No. Sample, $mathcal{S}$ $P(mathcal{S})$ 1 ${1,3,5}$ $1/8$ 2 ${1,3,6}$ $1/8$ 3 ${1,4,5}$ $1/8$ 4 ${1,4,6}$ $1/8$ 5 ${2,3,5}$ $1/8$ 6 ${2,3,6}$ $1/8$ 7 ${2,4,5}$ $1/8$ 8 ${2,4,6}$ $1/8$ - What is the value of $bar{y}_U$?
- Let $bar{y}$ be the mean of the sample values. For each sampling plan, find
- $mathrm{E}bar{y}$;
- $mathrm{Var}bar{y}$;
- $mathrm{Bias}(bar{y})$;
- $mathrm{MSE}(bar{y})$;
- Mayr et al. (1994) took an SRS of 240 children who visisted their pediatric outpatient clinic. They found the following frequency distribution for the age (in months) of free (unassisted) walking among the children:
Age (months) 9 10 11 12 13 14 15 16 17 18 19 20 Number of Children 13 35 44 69 36 24 7 3 2 5 1 1
Find the mean and SE of the age for onset of free walking. - Table 1 gives the cultivated area in acres in 1981 for 40 villages in a region. (Theory and Method of Survey) Using the arrangement (random) of data in the table, draw systematic sample of size 8. Use r ((random start) = 2,
Village $Y_j$ Village $Y_j$ Village $Y_j$ Village $Y_j$ 1 105 11 319 21 70 31 16 2 625 12 72 22 249 32 439 3 47 13 109 23 384 33 123 4 312 14 91 24 482 34 207 5 327 15 152 25 378 35 145 6 230 16 189 26 111 36 666 7 240 17 365 27 534 37 338 8 203 18 70 28 306 38 624 9 535 9 249 29 655 39 501 10 275 20 384 30 102 40 962
Solutions
In order to appreciate the codes, I will share some theoretical part of the solution. But our main focus here is to solve this problem computationally using Python and R.- The value of $bar{y}_U$ is coded as follows:
Python CodeThis file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode charactersimport numpy as np dat = np.array([98, 102, 154, 133, 190, 175]) y_u = dat.mean() y_u # OUTPUT 142.0 This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode characterslibrary(magrittr) dat <- c(98, 102, 154, 133, 190, 175) y_u <- dat %>% mean y_u #OUTPUT [1] 142 - To obtain the sample using the sample index given in the table in the above question, we do a combination of population index of three elements, ${6choose 3}$, first. Where the first two combinations are the samples, ${1,2,3}$ and ${1,2,4}$, and so on. Then from this list of all possible combinations of three elements, we draw those that are listed in the above table as our samples, with first sample index ${1,3,5}$, having population units, ${98, 154, 190}$. So that the following is the code of this sampling design:
Python CodeThis file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode charactersimport itertools as it import pandas as pd # Combination of the Population smple = np.array(list(it.combinations([1, 2, 3, 4, 5, 6], 3))) # Index of the sample smple_idx = [5, 6, 7, 8, 11, 12, 13, 14] # The sampled data smple_dat = pd.DataFrame(dat[smple[smple_idx] - np.array([1])], columns = np.arange(3)) smple_dat # OUTPUT 0 1 2 0 98 154 190 1 98 154 175 2 98 133 190 3 98 133 175 4 102 154 190 5 102 154 175 6 102 133 190 7 102 133 175 This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode characterslibrary(magrittr) # Combination of the Population smple <- c(1, 2, 3, 4, 5, 6) %>% combn(3) %>% t # Index of the sample smple_idx <- c(6, 7, 8, 9, 12, 13, 14, 15) # The sampled data smple_dat <- dat[smple[smple_idx,]] %>% matrix(ncol = 3) %>% data.frame smple_dat # OUTPUT X1 X2 X3 1 98 154 190 2 98 154 175 3 98 133 190 4 98 133 175 5 102 154 190 6 102 154 175 7 102 133 190 8 102 133 175 - Now to obtain the expected value of the average of the sample data, we compute it using $mathrm{E}bar{y}=sum_{k}bar{y}_kmathrm{P}(bar{y}_k)=sum_{k}bar{y_k}mathrm{P}(mathcal{S}_k)$, $forall kin{1,cdots,8}$. So for $k = 1$, $$ begin{aligned} bar{y}_1mathrm{P}(mathcal{S}_1)&=frac{98+154+190}{3}mathrm{P}(mathcal{S}_1)\ &=frac{98+154+190}{3}left(frac{1}{8}right)=18.41667. end{aligned} $$ Applying this to the remaining $n-1$ $k$s, and summing up the terms gives us the answer to $mathrm{E}bar{y}$. So that the following is the equivalent of this:
Python CodeThis file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode charactersimport numpy as np # The probability of the samples ps = 1.0 / 8.0 # Define the expectation function def expectation(df, prob): ex = [None] * df.shape[0] ps = np.asarray(prob) for i in np.arange(df.shape[0]): if ps.size == 1: ex[i] = df.ix[i, :].mean() * ps else: ex[i] = df.ix[i, :].mean() * ps[i] return np.array(ex).sum() # Compute the expectation of the ybar expectation(sample_dat, ps) # OUTPUT 142.0 This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode characterslibrary(magrittr) # The Probability of the Samples ps = 1.0 / 8.0 # Define the expectation function expectation <- function (df, prob) { ex <- numeric() df <- df %>% as.data.frame for (i in (df %>% dim)[1] %>% seq) { if (prob %>% length == 1) { ex[i] <- df[i, ] %>% as.numeric %>% mean * prob } else { ex[i] <- df[i, ] * prob[i] } } ex %>% sum %>% return } # Compute the expectation of the data expectation(smple_dat, ps) # OUTPUT [1] 142 - Next is to compute for the variance of $bar{y}$, which is $mathrm{Var}bar{y}=mathrm{E}bar{y}^{2}-(mathrm{E}bar{y})^2$. So we need a function for $mathrm{E}bar{y}^2$, where the first term of this, $k=1$, is $bar{y}_1^2mathrm{P}(mathcal{S}_1)=left(frac{98+154+190}{3}right)^2mathrm{P}(mathcal{S}_1)=left(frac{98+154+190}{3}right)^2(frac{1}{8})=2713.3889$. Applying this to other terms and summing them up, we have following code:
Python CodeThis file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode charactersimport numpy as np # Define the expectation function for ybar^2 def expectation2(df, prob): ex = [None] * df.shape[0] ps = np.asarray(prob) for i in np.arange(df.shape[0]): if ps.size == 1: ex[i] = (df.ix[i, :].mean() ** 2) * ps else: ex[i] = (df.ix[i, :] ** 2) * ps[i] return np.array(ex).sum() # Compute the expectation of ybar^2 expectation2(smple_dat, ps) # OUTPUT 20182.944444444445 This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode characterslibrary(magrittr) # Define the expectation function for ybar^2 expectation2 <- function (df, prob) { ex <- numeric() df <- df %>% as.data.frame for (i in (df %>% dim)[1] %>% seq) { if (prob %>% length == 1) { ex[i] <- (df[i, ] %>% as.numeric %>% mean) ^ 2 * prob } else { ex[i] <- df[i, ] ^ 2 * prob[i] } } ex %>% sum %>% return } # Compute the expectation of ybar^2 expectation2(smple_dat, ps) # OUTPUT [1] 20182.94
Python Code:This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode charactersdef variance(df, prob): return expectation2(df, prob) - expectation(df, prob) ** 2 variance(smple_dat, ps) # OUTPUT 18.944444444445253 This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode charactersvariance <- function (df, prob) { expectation2(df, prob) - expectation(df, prob) ^ 2 %>% return } variance(smple_dat, ps) # OUTPUT [1] 18.94444
- Now to obtain the expected value of the average of the sample data, we compute it using $mathrm{E}bar{y}=sum_{k}bar{y}_kmathrm{P}(bar{y}_k)=sum_{k}bar{y_k}mathrm{P}(mathcal{S}_k)$, $forall kin{1,cdots,8}$. So for $k = 1$, $$ begin{aligned} bar{y}_1mathrm{P}(mathcal{S}_1)&=frac{98+154+190}{3}mathrm{P}(mathcal{S}_1)\ &=frac{98+154+190}{3}left(frac{1}{8}right)=18.41667. end{aligned} $$ Applying this to the remaining $n-1$ $k$s, and summing up the terms gives us the answer to $mathrm{E}bar{y}$. So that the following is the equivalent of this:
- The $mathrm{Bias}$ is just the difference between the estimate and the true value. And since the estimate is unbiased ($mathrm{E}bar{y}=142$), so $mathrm{Bias}=142-142=0$.
- $mathrm{MSE}=mathrm{Var}bar{y}-(mathrm{Bias}bar{y})^2$, and since the $mathrm{Bias}bar{y}=0$. So $mathrm{MSE}=mathrm{Var}bar{y}$.
- The value of $bar{y}_U$ is coded as follows:
- First we need to obtain the probability of each Age, that is by dividing the Number of Children with the total sum of it. That is why, we have
p_s
function defined below. After obtaining the probabilities, we can then compute the expected value using theexpectation
function we defined earlier.
Python CodeThis file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode charactersimport pandas as pd # The data dat2 = pd.DataFrame({ "age" : [9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20], "nchild" : [13, 35, 44, 69, 36, 24, 7, 3, 2, 5, 1, 1] }) # Function for computing the probability of age def p_s(df): return df / df.sum() # Compute the probabilities of the age pr = p_s(dat2.ix[:, "nchild"]) # Compute the mean expectation(dat2.ix[:, "age"], pr) # OUTPUT 12.079166666666669 This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode characterslibrary(magrittr) # The data dat2 <- data.frame( "age" = c(9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20), "nchild" = c(13, 35, 44, 69, 36, 24, 7, 3, 2, 5, 1, 1) ) # Function for computing the probability of age p_s <- function (df) { (df / (df %>% sum)) %>% return } # Compute the probabilities of the age pr = p_s(dat2[, "nchild"]) # Compute the mean expectation(dat2[, "age"], pr) # OUTPUT [1] 12.07917
Python CodeThis file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode charactersnp.sqrt(variance(dat2.ix[:, "age"], pr)) # OUTPUT 1.9208242949895602 This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode charactersvariance(dat2[, "age"], pr) %>% sqrt # OUTPUT [1] 1.920824 - Let me give you a brief discussion on the systematic sampling to help you understand the code. The idea in systematic sampling is that, given the population units numbered from 1 to $N$, we compute for the sampling interval, given by $k = frac{N}{n}$, where $n$ is the number of units needed for the sample. After that, we choose for the random start, number between $1$ and $k$. This random start will be the first sample, and then the second unit in the sample is obtained by adding the sampling interval to the random start, and so on. There are two types of systematic sampling namely, Linear and Circular Systematic Samplings. Circular systematic sampling treats the population units numbered from $1$ to $N$ in circular form, so that if the increment step is more than the number of $N$ units, say $N+2$, the sample unit is the $2^{nd}$ element in the population, and so on. The code that I will be sharing can be used both for linear and circular, but for this particular problem only. Since there are rules in linear that are not satisfied in the function, one of which is if $k$ is not a whole number, despite that, however, you can always extend it to a more general function.
Python CodeThis file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode charactersimport numpy as np # The data sal_dat = np.array([25, 15, 20, 25, 18, 12, 24, 30, 15, 20, 10, 10, 11, 14, 22, 16]) salary = sal_dat * 1000 # Function for systematic sampling def sys_sample(df, r, n): k = df.shape[0] / n b = [None] * n; a = r b[0] = a for i in np.arange(1, n): a = a + k if a > df.shape[0]: a = a - df.shape[0] b[i] = a return {"Data" : df[b], "Index" : b, "K" : k} # Do the sampling for random start, # r = 2, and number of sample, n = 4 sys_sample(salary, r = 1, n = 8) # OUTPUT {'Data': array([15000, 25000, 12000, 30000, 20000, 10000, 14000, 16000]), 'Index': [1, 3, 5, 7, 9, 11, 13, 15], 'K': 2} This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode characterslibrary(magrittr) # The data sal_dat <- c(25, 15, 20, 25, 18, 12, 24, 30, 15, 20, 10, 10, 11, 14, 22, 16) salary <- sal_dat * 1000 # Function for systematic sampling sys_sample <- function (df, r, n) { k <- (df %>% length) / n b <- numeric(); a <- r b[1] <- a for (i in 2:n) { a <- a + k if (a > (df %>% length)) { a <- a - (df %>% length) } b[i] <- a } return (list(Data = df[b], Index = b, K = k)) } # Do the sampling for random start, # r = 2, and number of sample, n = 4 sys_sample(salary1, r = 2, n = 8) # OUTPUT $Data [1] 15000 25000 12000 30000 20000 10000 14000 16000 $Index [1] 2 4 6 8 10 12 14 16 $K [1] 2
Reference
To leave a comment for the author, please follow the link and comment on their blog: Analysis with Programming.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.