Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
On my previous post Advanced Survey Design and Application to Big Data I mentioned unsupervised learning can be used to generate a stratification variable. In this post I want to elaborate on this point and how they can work together to improve estimates and training data for predictive models.
SRS and stratified samples
Consider the estimators of the total from an SRS and stratified sample.
The variance of these estimators are given by
The variance of the stratification estimator is made up of two components, the within and between strata sum of squares.
With some algebra it can be shown
This result shows that
Unsupervised Learning
Unsupervised learning attempts to uncover hidden structure in the observed data by sorting the observations into a chosen number of clusters. The simplest algorithm to do this is k-means. The k-means algorithm is as follows:
- Choose
(number of clusters) - Choose
random points and assign as centers - Compute the distance between each point and each center
- Assign each observation to the center they are closest to
- Compute the new centers given the cluster allocation
where contains the points allocated to cluster - Compute the between and within sum of squares
- Repeat 3-6 until the clusters do not change, meet a specified tolerance or max iterations met
The algorithm will minimise the within sum of squares and maximise the between sum of squares.
As we saw from the formula above the estimator under a stratified sample performs better than an SRS when
From here it’s easy to see that if we construct a stratification variable which aims to minimise
The post Improve Your Training Set with Unsupervised Learning appeared first on Daniel Oehm | Gradient Descending.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.