Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Small area estimation (SAE) has become a widely used technique in official statistics since the last decade of past century. When the sample size is not enough to provide reliable estimates at a very particular level, the power of models and auxiliary information must be applied with no hesitation. In a nutshell, SAE tries to exploits similarity and borrows strength from available information.
I will write some posts to present, step by step, the fundamentals of SAE and how it can be implemented in the R software. The first post (this one you are reading now) is about basic concepts such as sampling and databases, the second and third posts will deal with direct and indirect estimates, respectively; the fourth post will introduce model-assisted estimation; and finally, the fifth post will deal with the Fay-Harriot method.
Let’s begin. First of all, The Australian Bureau of Statistics declares that small area estimation refers to methods of producing sufficiently reliable estimates for geographic areas that are too fine to obtain with precision, using direct survey estimation methods. By direct estimation, we mean classical design-based survey estimation methods that utilize only the sample units contained in each small area. Small area estimation methods are used to overcome the problem of small samples sizes to produce small area estimates that improve the quality of direct survey estimates obtained from the sample in each small area. The more sophisticated of these methods work by taking advantage of various relationships in the data, and involve, either implicitly or explicitly, a statistical model to describe these relationships.
Now, I want to reproduce a clarifying explanation from Dr. Little (paraphrasing Groves) about SAE, and its use in survey sampling. They claim that regression estimates provide relatively precise predictions for small areas from a survey that account for the differences between areas of characteristics included as predictors in the survey but do not account for differences in characteristics not included in the study. Direct estimates for each area are unique to the area and hence take into account both observed and unobserved relevant characteristics; however, they have low precision in areas where the sample size is small. The SAE model combines the regression estimate and direct estimate for each area in a sensible way, balancing bias and precision.
For this technique to succeeds, Longford 2005 claims that areas that are known to be similar to one another should be receiving similar estimates, rather than estimates independent of one another. The degree to which similarity can or should be imposed can be chosen from statistical grounds by minimizing the overall discrepancy (mean squared error).
Rahman 2008 emphasizes that SAE uses data from similar domains to estimate the statistics in a particular small area of interest, and this ‘borrowing of strength’ is justified by assuming a model which relates the small area statistics. SAE is the process of using statistical models to link survey outcome or response variables to a set of predictor variables known for small areas to predict small area-level estimates. Traditional area-specific estimates may not provide enough statistical precision because of small sample observations in small geographical regions. In such situation, it may be worth checking whether it is possible to use indirect estimation approaches based on the linking models.
R workshop
This code will not produce any small area estimation, but it will help to introduce basic concepts of sampling. We will use the BigLucy database from the TeachingSampling package to illustrate how to obtain a probabilistic sample from a finite population. This database is about deals with some economic variables for a population of 85296 companies spread out into 100 counties (areas) in a particular year of some fake country. The aim of the exercise is 1) to select a stratified sample according to the size of the companies and 2) obtain accurate estimates of the total income within each of the 100 counties. Note that the parameters of interest are given by:
$$t_{y,d} = \sum_{k \in U_d} y_k$$
Where $t_{y,d}$ denotes the total income of the $d$-th county, $y_k$ is the income of the $k$-th company belonging the $d$-th county. The whole population of companies into the county is noted by $U_d$. Thus, this code computes the total income for each county along with the number of companies belonging each county. Finally, this parameters are saved in a new database named Results.
############################# ##### Setting things up ##### ############################# setwd(“/wherever your prefer location is") rm(list = ls()) set.seed(2017) library(TeachingSampling) library(dplyr) data("BigLucy") summary(BigLucy$Level) levels(BigLucy$Zone) Total <- BigLucy %>% group_by(Zone) %>% summarise(Income. = sum(Income)) %>% arrange(Zone) N <- BigLucy %>% group_by(Zone) %>% summarise(N.county = n()) %>% arrange(Zone) Results <- data.frame(N, Total$Income.) #Checking the population total (Total <- sum(Results$Total.Income.))
Now, once the parameters of the finite populations are computed, the time has come for us to select a sample. The sampling design we will use is an stratified sampling by taking advantage of the classification of each company into the database according to its level: small, medium and large. The selected sample will be stored into an object named data.sample and the sampling weights will be saved in another object called FEX.
####################################### ##### Drawing a stratified sample ##### ####################################### # Level is the stratifying variable summary(BigLucy$Level) attach(BigLucy) # Defines the size of each stratum N1 <- summary(Level)[[1]] N2 <- summary(Level)[[2]] N3 <- summary(Level)[[3]] N1;N2;N3 Nh <- c(N1,N2,N3) # Defines the sample size at each stratum n1 <- round(N1 * 0.05) n2 <- round(N2 * 0.05) n3 <- round(N3 * 0.05) (nh<-c(n1,n2,n3)) # Draws a stratified sample sam <- S.STSI(Level, Nh, nh) data.sam <- BigLucy[sam,] data.sam$FEX <- NULL data.sam$FEX[data.sam$Level == "Big"] <- Nh[1] / nh[1] data.sam$FEX[data.sam$Level == "Medium"] <- Nh[2] / nh[2] data.sam$FEX[data.sam$Level == "Small"] <- Nh[3] / nh[3] dim(data.sam) save(data.sam, file = "data.sam.RData")
In summary, we have drawn a sample of 4265 companies: 145 big companies, 1290 medium companies and, 2830 small companies. Each of these companies is spread out through the 100 counties in the country. In the Results database will be stored the information about those counties (how many companies are in each county and, how many companies are in the sample for each county.) along with different estimations for the total income of the counties.
###################### ##### In summary ##### ###################### n <- data.sam %>% group_by(Zone) %>% summarise(n.county = n()) %>% arrange(Zone) Results$n.county <- n$n.county Results <- Results[c("Zone", "N.county", "n.county", "Total.Income.")] save(Results, file = "Results.RData")
In the following post, we will use this sample to estimate the total income for each of the hundred zones in BigLucy’s population.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.