useR! 2016 Tutorials: Part 2
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
by Joseph Rickert
Last week, I mentioned a few of the useR tutorials that I had the opportunity to attend. Here are the links to the slides and code for all but two of the tutorials:
Regression Modeling Strategies and the rms Package – Frank Harrell
Using Git and GitHub with R, RStudio, and R Markdown – Jennifer Bryan
Effective Shiny Programming – Joe Cheng
Missing Value Imputation with R -Julie Josse
Extracting data from the web APIs and beyond – Ram, Grolemund & Chamberlain
Ninja Moves with data.table – Learn by Doing in a Cookbook Style Workshop – Matt Dowle
Never Tell Me the Odds! Machine Learning with Class Imbalances – Max Kuhn
MoRe than woRds, Text and Context: Language Analytics in Finance with R – Das & Mokashi
Time-to-Event Modeling as the Foundation of Multi-Channel Revenue Attribution – Tess Calvez
Handling and Analyzing Spatial, Spatiotemporal and Movement Data – Edzer Pebesma
Machine Learning Algorithmic Deep Dive – Erin LeDell
Introduction to SparkR – Venkataraman & Falaki
Using R with Jupyter Notebooks for Reproducible Research de Vries & Harris
Understanding and Creating Interactive Graphics Part 1, Part 2– Hocking & Ekstrom
Genome-Wide Association Analysis and Post-Analytic Interrogation Part 1, Part 2 – Foulkes
An Introduction to Bayesian Inference using R Interfaces to Stan – Ben Goodrich
Small Area Estimation with R – Virgilio Gómez Rubio
Dynamic Documents with R Markdown – Yihui Xie
Granted that since the tutorials were not videotaped they mostly fall into the category of a “you had to be there” experience. However, many of the presenters put a significant effort into preparing their talks and collectively they comprise a rich resource that is worth a good look. Here are just of couple examples of what is to be found.
The first comes from Julie Josse's Missing Data tutorial where a version of the ozone data set with missing values is used to illustrate a basic principle of exploratory data analysis: visualize your data and look for missing values. If there are missing values try to determine if there are any patterns in their location.
maxO3 T9 T12 T15 Ne9 Ne12 Ne15 Vx9 Vx12 Vx15 maxO3v 20010601 87 15.6 18.5 NA 4 4 8 0.6946 -1.7101 -0.6946 84 20010602 82 NA NA NA 5 5 7 -4.3301 -4.0000 -3.0000 87 20010603 92 15.3 17.6 19.5 2 NA NA 2.9544 NA 0.5209 82 20010604 114 16.2 19.7 NA 1 1 0 NA 0.3473 -0.1736 92 20010605 94 NA 20.5 20.4 NA NA NA -0.5000 -2.9544 -4.3301 114 20010606 80 17.7 19.8 18.3 6 NA 7 -5.6382 -5.0000 -6.0000 94
These first two plot from made with the aggr() function in the VIM package shows proportion of missing values for each variable and relationship of missing values among all of the variables.
The next plot shows a scatter plot of two variables along boxplots along the margins that show the distributions of missing values for each variable. (Here blue represents data that are present and red the missing values.) The code to do this and many more advanced analyses is included on the tutorial page.
It looks like missing values are pretty much spread among the data.
Frank Harrell's tutorial provides a modern look at regression analysis from a statisticians point of view. The following plot comes from the section of his tutorial on Modeling and Testing Complex Interactions. If you haven't paid much attention to the the theory behind interpreting linear models in a while you may find this interesting.
Finally, I had one of those “Aha” moments right at beginning of Ben Goodrich's presentation on Bayesian modeling. MCMC methods work by simulating draws from a Markov chain whose limiting distribution converges to the distribution of interest. This technique works best when the simulated draws are able to explore the entire space of the target distribution. In the following the figure, the target is the bivariate normal distribution on the far right. Neither the Metropolis nor Gibbs Sampling algorithms come close to sampling from the entire target distribution space, but the Hamiltonian Monte Carlo “NUTS” algorithm in the STAN package displays very good coverage.
For reasons I described last week I believe that this year's useR tutorial speakers have raised the bar on both content and presentation. I am going to do my best to work through these before attending next year's conference in Brussels.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.