Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
You can find this post in Blogdown format by clicking here
When it comes to analyzing survey data, you have to take into account the stochastic structure of the sample that was selected to obtain the data. Plots and graphics should not be an exception. The main aim of such studies is to try to infer about how the behavior of the outcomes of interest in the finite population.
For example, you may want to know how many people are in poverty. How about counting the poor people in the sample? I know, it is not a good idea. You have to use the sampling weights to estimate the whole number of poor people in the population. The total error paradigm follows the same approach when estimating any parameter of the finite population: you have to use the sampling weights in order to obtain unbiased estimators.
The idea behind this paradigm is the representative principle. If a person $k$ is included in the sample with a sampling weight $w_k$, she/he represents to himself and $w_k-1$ remaining people. That way, you obtain design-unbiasedness.
I am not a big fan of using descriptive sample statistics to analyze survey data because it can mask the reality, and the fact is that person $k$ is included in the sample, but he/she is not alone. Behind person $k$ there are other people, not included in the sample, and you have to realize that.
So, let’s apply that principle to scatter plots. I am using the Lucy
population from the TeachingSampling
package to recreate my idea. The following code is used to draw a $\pi PS$ sample.
library(TeachingSampling) data(Lucy) # The inclusion probability is proportional to Income # The selected sample of size n=400 n <- 400 set.seed(123) res <- S.piPS(n, Lucy$Income) sam <- res[,1] # The sample is stored in an data.sample data.sample <- Lucy[sam, ] attach(data.sample)
The sampling weights will be stored in the data.sample
data in the column wk
. They will be useful to reproduce our finite popultaion from the sample data.
# Pik.s is the inclusion probability of units in the sample data.sample$Pik.s <- res[,2] # wk is the sampling weight data.sample$wk <- 1/data.sample$Pik.s
Now, let`s make a plot of the sampling data (with just 400 observations). I recall that this scenario is somehow misleading, because we want to know the behavior of the variables in the finite population.
library(ggplot2) ggplot(data = data.sample, aes(x = Income, y = Employees)) + geom_point()
The first option that comes to mind is to include the sampling weights in the points of the scatter plot. However, this approach is not appealing to me, because it is not straightforward from this plot to visuzlize the entire finite population.
ggplot(data = data.sample, aes(x = Income, y = Employees, size = wk)) + geom_point()
In order to make the finite population scatter plot from the survey sample, I will replicate the rows of the data.sample
object as many times as the sampling weight wk
. I am using the mefa::rep
function to achieve this goal. So, the newLucy
object is an intent to mimic the finite population by using the selected sample.
library(mefa) newLucy <- NULL for(i in 1:nrow(data.sample)){ newLucy <- rbind(newLucy, rep(data.sample[i, ], round(data.sample$wk[i]))) } newLucy <- as.data.frame(newLucy)
Now, with the newLucy
population, I will make a scatter plot. Now, as I am replicating the rows of the sample data, I will add a jitter to avoid overplotting of the points in the scatter plot. This way, this plot (with 2396 observations) looks as if it would come from the finite population.
ggplot(data = newLucy, aes(x = Income, y = Employees)) + geom_point() + geom_jitter(width = 15, height = 15)
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.