useR! 2016 Tutorials: Part 2

Joseph Rickert

6 years ago

[This article was first published on Revolutions, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

by Joseph Rickert

Last week, I mentioned a few of the useR tutorials that I had the opportunity to attend. Here are the links to the slides and code for all but two of the tutorials:

Regression Modeling Strategies and the rms Package – Frank Harrell
Using Git and GitHub with R, RStudio, and R Markdown – Jennifer Bryan
Effective Shiny Programming – Joe Cheng
Missing Value Imputation with R -Julie Josse
Extracting data from the web APIs and beyond – Ram, Grolemund & Chamberlain
Ninja Moves with data.table – Learn by Doing in a Cookbook Style Workshop – Matt Dowle
Never Tell Me the Odds! Machine Learning with Class Imbalances – Max Kuhn
MoRe than woRds, Text and Context: Language Analytics in Finance with R – Das & Mokashi
Time-to-Event Modeling as the Foundation of Multi-Channel Revenue Attribution – Tess Calvez
Handling and Analyzing Spatial, Spatiotemporal and Movement Data – Edzer Pebesma
Machine Learning Algorithmic Deep Dive – Erin LeDell
Introduction to SparkR – Venkataraman & Falaki
Using R with Jupyter Notebooks for Reproducible Research de Vries & Harris
Understanding and Creating Interactive Graphics Part 1, Part 2– Hocking & Ekstrom
Genome-Wide Association Analysis and Post-Analytic Interrogation Part 1, Part 2 – Foulkes
An Introduction to Bayesian Inference using R Interfaces to Stan – Ben Goodrich
Small Area Estimation with R – Virgilio Gómez Rubio
Dynamic Documents with R Markdown – Yihui Xie

Granted that since the tutorials were not videotaped they mostly fall into the category of a "you had to be there" experience. However, many of the presenters put a significant effort into preparing their talks and collectively they comprise a rich resource that is worth a good look. Here are just of couple examples of what is to be found.

The first comes from Julie Josse's Missing Data tutorial where a version of the ozone data set with missing values is used to illustrate a basic principle of exploratory data analysis: visualize your data and look for missing values. If there are missing values try to determine if there are any patterns in their location.

        maxO3   T9  T12  T15 Ne9 Ne12 Ne15     Vx9    Vx12    Vx15 maxO3v
20010601    87 15.6 18.5   NA   4    4    8  0.6946 -1.7101 -0.6946     84
20010602    82   NA   NA   NA   5    5    7 -4.3301 -4.0000 -3.0000     87
20010603    92 15.3 17.6 19.5   2   NA   NA  2.9544      NA  0.5209     82
20010604   114 16.2 19.7   NA   1    1    0      NA  0.3473 -0.1736     92
20010605    94   NA 20.5 20.4  NA   NA   NA -0.5000 -2.9544 -4.3301    114
20010606    80 17.7 19.8 18.3   6   NA    7 -5.6382 -5.0000 -6.0000     94

These first two plot from made with the aggr() function in the VIM package shows proportion of missing values for each variable and relationship of missing values among all of the variables.

The next plot shows a scatter plot of two variables along boxplots along the margins that show the distributions of missing values for each variable. (Here blue represents data that are present and red the missing values.) The code to do this and many more advanced analyses is included on the tutorial page.

It looks like missing values are pretty much spread among the data.

Frank Harrell's tutorial provides a modern look at regression analysis from a statisticians point of view. The following plot comes from the section of his tutorial on Modeling and Testing Complex Interactions. If you haven't paid much attention to the the theory behind interpreting linear models in a while you may find this interesting.

Finally, I had one of those "Aha" moments right at beginning of Ben Goodrich's presentation on Bayesian modeling. MCMC methods work by simulating draws from a Markov chain whose limiting distribution converges to the distribution of interest. This technique works best when the simulated draws are able to explore the entire space of the target distribution. In the following the figure, the target is the bivariate normal distribution on the far right. Neither the Metropolis nor Gibbs Sampling algorithms come close to sampling from the entire target distribution space, but the Hamiltonian Monte Carlo "NUTS" algorithm in the STAN package displays very good coverage.

For reasons I described last week I believe that this year's useR tutorial speakers have raised the bar on both content and presentation. I am going to do my best to work through these before attending next year's conference in Brussels.

To leave a comment for the author, please follow the link and comment on their blog: Revolutions.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.