Python and R: Basic Sampling Problem

[This article was first published on Analysis with Programming, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

In this post, I would like to share a simple problem about sampling analysis. And I will demonstrate how to solve this using Python and R. The first two problems are originally from Sampling: Design and Analysis book by Sharon Lohr.

Problems

  1. Let $N=6$ and $n=3$. For purposes of studying sampling distributions, assume that all population values are known.

    $y_1 = 98$$y_2 = 102$$y_3=154$
    $y_4 = 133$$y_5 = 190$$y_6=175$

    We are interested in $bar{y}_U$, the population mean. Consider eight possible samples chosen.

    Sample No.Sample, $mathcal{S}$$P(mathcal{S})$
    1${1,3,5}$$1/8$
    2${1,3,6}$$1/8$
    3${1,4,5}$$1/8$
    4${1,4,6}$$1/8$
    5${2,3,5}$$1/8$
    6${2,3,6}$$1/8$
    7${2,4,5}$$1/8$
    8${2,4,6}$$1/8$

    1. What is the value of $bar{y}_U$?
    2. Let $bar{y}$ be the mean of the sample values. For each sampling plan, find
      1. $mathrm{E}bar{y}$;
      2. $mathrm{Var}bar{y}$;
      3. $mathrm{Bias}(bar{y})$;
      4. $mathrm{MSE}(bar{y})$;
  2. Mayr et al. (1994) took an SRS of 240 children who visisted their pediatric outpatient clinic. They found the following frequency distribution for the age (in months) of free (unassisted) walking among the children:

    Age (months)91011121314151617181920
    Number of Children133544693624732511

    Find the mean and SE of the age for onset of free walking.
  3. Table 1 gives the cultivated area in acres in 1981 for 40 villages in a region. (Theory and Method of Survey) Using the arrangement (random) of data in the table, draw systematic sample of size 8. Use r ((random start) = 2,

    Village$Y_j$Village$Y_j$Village$Y_j$Village$Y_j$
    11051131921703116
    262512722224932439
    347131092338433123
    431214912448234207
    5327151522537835145
    6230161892611136666
    7240173652753437338
    820318702830638624
    953592492965539501
    10275203843010240962

Solutions

In order to appreciate the codes, I will share some theoretical part of the solution. But our main focus here is to solve this problem computationally using Python and R.
    1. The value of $bar{y}_U$ is coded as follows:

      Python Code
      import numpy as np
      dat = np.array([98, 102, 154, 133, 190, 175])
      y_u = dat.mean()
      y_u
      # OUTPUT
      142.0
      view raw samp1.py hosted with ❤ by GitHub
      R Code
      library(magrittr)
      dat <- c(98, 102, 154, 133, 190, 175)
      y_u <- dat %>% mean
      y_u
      #OUTPUT
      [1] 142
      view raw samp1.R hosted with ❤ by GitHub
    2. To obtain the sample using the sample index given in the table in the above question, we do a combination of population index of three elements, ${6choose 3}$, first. Where the first two combinations are the samples, ${1,2,3}$ and ${1,2,4}$, and so on. Then from this list of all possible combinations of three elements, we draw those that are listed in the above table as our samples, with first sample index ${1,3,5}$, having population units, ${98, 154, 190}$. So that the following is the code of this sampling design:

      Python Code
      import itertools as it
      import pandas as pd
      # Combination of the Population
      smple = np.array(list(it.combinations([1, 2, 3, 4, 5, 6], 3)))
      # Index of the sample
      smple_idx = [5, 6, 7, 8, 11, 12, 13, 14]
      # The sampled data
      smple_dat = pd.DataFrame(dat[smple[smple_idx] - np.array([1])], columns = np.arange(3))
      smple_dat
      # OUTPUT
      0 1 2
      0 98 154 190
      1 98 154 175
      2 98 133 190
      3 98 133 175
      4 102 154 190
      5 102 154 175
      6 102 133 190
      7 102 133 175
      view raw samp2.py hosted with ❤ by GitHub
      R Code
      library(magrittr)
      # Combination of the Population
      smple <- c(1, 2, 3, 4, 5, 6) %>% combn(3) %>% t
      # Index of the sample
      smple_idx <- c(6, 7, 8, 9, 12, 13, 14, 15)
      # The sampled data
      smple_dat <- dat[smple[smple_idx,]] %>% matrix(ncol = 3) %>% data.frame
      smple_dat
      # OUTPUT
      X1 X2 X3
      1 98 154 190
      2 98 154 175
      3 98 133 190
      4 98 133 175
      5 102 154 190
      6 102 154 175
      7 102 133 190
      8 102 133 175
      view raw samp2.R hosted with ❤ by GitHub
      1. Now to obtain the expected value of the average of the sample data, we compute it using $mathrm{E}bar{y}=sum_{k}bar{y}_kmathrm{P}(bar{y}_k)=sum_{k}bar{y_k}mathrm{P}(mathcal{S}_k)$, $forall kin{1,cdots,8}$. So for $k = 1$, $$ begin{aligned} bar{y}_1mathrm{P}(mathcal{S}_1)&=frac{98+154+190}{3}mathrm{P}(mathcal{S}_1)\ &=frac{98+154+190}{3}left(frac{1}{8}right)=18.41667. end{aligned} $$ Applying this to the remaining $n-1$ $k$s, and summing up the terms gives us the answer to $mathrm{E}bar{y}$. So that the following is the equivalent of this:

        Python Code
        import numpy as np
        # The probability of the samples
        ps = 1.0 / 8.0
        # Define the expectation function
        def expectation(df, prob):
        ex = [None] * df.shape[0]
        ps = np.asarray(prob)
        for i in np.arange(df.shape[0]):
        if ps.size == 1:
        ex[i] = df.ix[i, :].mean() * ps
        else:
        ex[i] = df.ix[i, :].mean() * ps[i]
        return np.array(ex).sum()
        # Compute the expectation of the ybar
        expectation(sample_dat, ps)
        # OUTPUT
        142.0
        view raw samp3.py hosted with ❤ by GitHub
        R Code
        library(magrittr)
        # The Probability of the Samples
        ps = 1.0 / 8.0
        # Define the expectation function
        expectation <- function (df, prob) {
        ex <- numeric()
        df <- df %>% as.data.frame
        for (i in (df %>% dim)[1] %>% seq) {
        if (prob %>% length == 1) {
        ex[i] <- df[i, ] %>% as.numeric %>% mean * prob
        } else {
        ex[i] <- df[i, ] * prob[i]
        }
        }
        ex %>% sum %>% return
        }
        # Compute the expectation of the data
        expectation(smple_dat, ps)
        # OUTPUT
        [1] 142
        view raw samp3.R hosted with ❤ by GitHub
        From the above code, the output tells us that $mathrm{E}bar{y}=140$.
      2. Next is to compute for the variance of $bar{y}$, which is $mathrm{Var}bar{y}=mathrm{E}bar{y}^{2}-(mathrm{E}bar{y})^2$. So we need a function for $mathrm{E}bar{y}^2$, where the first term of this, $k=1$, is $bar{y}_1^2mathrm{P}(mathcal{S}_1)=left(frac{98+154+190}{3}right)^2mathrm{P}(mathcal{S}_1)=left(frac{98+154+190}{3}right)^2(frac{1}{8})=2713.3889$. Applying this to other terms and summing them up, we have following code:

        Python Code
        import numpy as np
        # Define the expectation function for ybar^2
        def expectation2(df, prob):
        ex = [None] * df.shape[0]
        ps = np.asarray(prob)
        for i in np.arange(df.shape[0]):
        if ps.size == 1:
        ex[i] = (df.ix[i, :].mean() ** 2) * ps
        else:
        ex[i] = (df.ix[i, :] ** 2) * ps[i]
        return np.array(ex).sum()
        # Compute the expectation of ybar^2
        expectation2(smple_dat, ps)
        # OUTPUT
        20182.944444444445
        view raw samp4.py hosted with ❤ by GitHub
        R Code
        library(magrittr)
        # Define the expectation function for ybar^2
        expectation2 <- function (df, prob) {
        ex <- numeric()
        df <- df %>% as.data.frame
        for (i in (df %>% dim)[1] %>% seq) {
        if (prob %>% length == 1) {
        ex[i] <- (df[i, ] %>% as.numeric %>% mean) ^ 2 * prob
        } else {
        ex[i] <- df[i, ] ^ 2 * prob[i]
        }
        }
        ex %>% sum %>% return
        }
        # Compute the expectation of ybar^2
        expectation2(smple_dat, ps)
        # OUTPUT
        [1] 20182.94
        view raw samp4.R hosted with ❤ by GitHub
        So that using the above output, 20182.94, and subtracting $(mathrm{E}bar{y})^2$ to it, will give us the variance. And hence the succeeding code:

        Python Code:
        def variance(df, prob):
        return expectation2(df, prob) - expectation(df, prob) ** 2
        variance(smple_dat, ps)
        # OUTPUT
        18.944444444445253
        view raw samp5.py hosted with ❤ by GitHub
        R Code:
        variance <- function (df, prob) {
        expectation2(df, prob) - expectation(df, prob) ^ 2 %>% return
        }
        variance(smple_dat, ps)
        # OUTPUT
        [1] 18.94444
        view raw samp5.R hosted with ❤ by GitHub
        So the variance of the $bar{y}$ is $18.9444$.
    3. The $mathrm{Bias}$ is just the difference between the estimate and the true value. And since the estimate is unbiased ($mathrm{E}bar{y}=142$), so $mathrm{Bias}=142-142=0$.
    4. $mathrm{MSE}=mathrm{Var}bar{y}-(mathrm{Bias}bar{y})^2$, and since the $mathrm{Bias}bar{y}=0$. So $mathrm{MSE}=mathrm{Var}bar{y}$.
  1. First we need to obtain the probability of each Age, that is by dividing the Number of Children with the total sum of it. That is why, we have p_s function defined below. After obtaining the probabilities, we can then compute the expected value using the expectation function we defined earlier.

    Python Code
    import pandas as pd
    # The data
    dat2 = pd.DataFrame({
    "age" : [9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20],
    "nchild" : [13, 35, 44, 69, 36, 24, 7, 3, 2, 5, 1, 1]
    })
    # Function for computing the probability of age
    def p_s(df):
    return df / df.sum()
    # Compute the probabilities of the age
    pr = p_s(dat2.ix[:, "nchild"])
    # Compute the mean
    expectation(dat2.ix[:, "age"], pr)
    # OUTPUT
    12.079166666666669
    view raw samp6.py hosted with ❤ by GitHub
    R Code
    library(magrittr)
    # The data
    dat2 <- data.frame(
    "age" = c(9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20),
    "nchild" = c(13, 35, 44, 69, 36, 24, 7, 3, 2, 5, 1, 1)
    )
    # Function for computing the probability of age
    p_s <- function (df) {
    (df / (df %>% sum)) %>% return
    }
    # Compute the probabilities of the age
    pr = p_s(dat2[, "nchild"])
    # Compute the mean
    expectation(dat2[, "age"], pr)
    # OUTPUT
    [1] 12.07917
    view raw samp6.R hosted with ❤ by GitHub
    It should be clear in the data that the average age is about 12 months old, where the plot of it is shown below, For the code of the above plot please click here. Next is to compute the standard error which is just the square root of the variance of the sample,

    Python Code
    np.sqrt(variance(dat2.ix[:, "age"], pr))
    # OUTPUT
    1.9208242949895602
    view raw samp7.py hosted with ❤ by GitHub
    R Code
    variance(dat2[, "age"], pr) %>% sqrt
    # OUTPUT
    [1] 1.920824
    view raw samp7.R hosted with ❤ by GitHub
    So the standard variability of the Age is 1.920824.
  2. Let me give you a brief discussion on the systematic sampling to help you understand the code. The idea in systematic sampling is that, given the population units numbered from 1 to $N$, we compute for the sampling interval, given by $k = frac{N}{n}$, where $n$ is the number of units needed for the sample. After that, we choose for the random start, number between $1$ and $k$. This random start will be the first sample, and then the second unit in the sample is obtained by adding the sampling interval to the random start, and so on. There are two types of systematic sampling namely, Linear and Circular Systematic Samplings. Circular systematic sampling treats the population units numbered from $1$ to $N$ in circular form, so that if the increment step is more than the number of $N$ units, say $N+2$, the sample unit is the $2^{nd}$ element in the population, and so on. The code that I will be sharing can be used both for linear and circular, but for this particular problem only. Since there are rules in linear that are not satisfied in the function, one of which is if $k$ is not a whole number, despite that, however, you can always extend it to a more general function.

    Python Code
    import numpy as np
    # The data
    sal_dat = np.array([25, 15, 20, 25, 18, 12, 24, 30, 15, 20, 10, 10, 11, 14, 22, 16])
    salary = sal_dat * 1000
    # Function for systematic sampling
    def sys_sample(df, r, n):
    k = df.shape[0] / n
    b = [None] * n; a = r
    b[0] = a
    for i in np.arange(1, n):
    a = a + k
    if a > df.shape[0]:
    a = a - df.shape[0]
    b[i] = a
    return {"Data" : df[b], "Index" : b, "K" : k}
    # Do the sampling for random start,
    # r = 2, and number of sample, n = 4
    sys_sample(salary, r = 1, n = 8)
    # OUTPUT
    {'Data': array([15000, 25000, 12000, 30000, 20000, 10000, 14000, 16000]),
    'Index': [1, 3, 5, 7, 9, 11, 13, 15],
    'K': 2}
    view raw samp8.py hosted with ❤ by GitHub
    R Code
    library(magrittr)
    # The data
    sal_dat <- c(25, 15, 20, 25, 18, 12, 24, 30, 15, 20, 10, 10, 11, 14, 22, 16)
    salary <- sal_dat * 1000
    # Function for systematic sampling
    sys_sample <- function (df, r, n) {
    k <- (df %>% length) / n
    b <- numeric(); a <- r
    b[1] <- a
    for (i in 2:n) {
    a <- a + k
    if (a > (df %>% length)) {
    a <- a - (df %>% length)
    }
    b[i] <- a
    }
    return (list(Data = df[b], Index = b, K = k))
    }
    # Do the sampling for random start,
    # r = 2, and number of sample, n = 4
    sys_sample(salary1, r = 2, n = 8)
    # OUTPUT
    $Data
    [1] 15000 25000 12000 30000 20000 10000 14000 16000
    $Index
    [1] 2 4 6 8 10 12 14 16
    $K
    [1] 2
    view raw samp8.R hosted with ❤ by GitHub
    You may notice in the output above, that the index returned in Python is not the same with the index returned in R. This is because Python index starts with 0, while that in R starts with 1. So that’s why we have the same population units sampled between the two language despite the differences between the index returned.

Reference

  1. Lohr, Sharon (2009). Sampling: Design and Analysis. Cengage Learning.

To leave a comment for the author, please follow the link and comment on their blog: Analysis with Programming.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)