Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
by John Mount Ph. D.
Data Scientist at Win-Vector LLC
In part 2 of her series on Principal Components Regression Dr. Nina Zumel illustrates so-called y-aware techniques. These often neglected methods use the fact that for predictive modeling problems we know the dependent variable, outcome or y, so we can use this during data preparation in addition to using it during modeling. Dr. Zumel shows the incorporation of y-aware preparation into Principal Components Analyses can capture more of the problem structure in fewer variables. Such methods include:
- Effects based variable pruning
- Significance based variable pruning
- Effects based variable scaling.
This recovers more domain structure and leads to better models. Using the foundation set in the first article Dr. Zumel quickly shows how to move from a traditional x-only analysis that fails to preserve a domain-specific relation of two variables to outcome to a y-aware analysis that preserves the relation. Or in other words how to move away from a middling result where different values of y (rendered as three colors) are hopelessly intermingled when plotted against the first two found latent variables as shown below.
Dr. Zumel shows how to perform a decisive analysis where y is somewhat sortable by the each of the first two latent variable and the first two latent variables capture complementary effects, making them good mutual candidates for further modeling (as shown below).
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.