Articles by Nina Zumel

Faceted Graphs with cdata and ggplot2

October 21, 2018 | Nina Zumel

In between client work, John and I have been busy working on our book, Practical Data Science with R, 2nd Edition. To demonstrate a toy example for the section I’m working on, I needed scatter plots of the petal and sepal dimensions of the iris data, like so: I ...
[Read more...]

Announcing Practical Data Science with R, 2nd Edition

August 15, 2018 | Nina Zumel

We are pleased and excited to announce that we are working on a second edition of Practical Data Science with R! Manning Publications has just announced the launching of the MEAP (Manning Early Access Program) for the second edition. The MEAP allows you to subscribe to drafts of chapters as ...
[Read more...]

Partial Pooling for Lower Variance Variable Encoding

September 28, 2017 | Nina Zumel

Banaue rice terraces. Photo: Jon Rawlinson In a previous article, we showed the use of partial pooling, or hierarchical/multilevel models, for level coding high-cardinality categorical variables in vtreat. In this article, we will discuss a little more about the how and why of partial pooling in R. We will ...
[Read more...]

Custom Level Coding in vtreat

September 25, 2017 | Nina Zumel

One of the services that the R package vtreat provides is level coding (what we sometimes call impact coding): converting the levels of a categorical variable to a meaningful and concise single numeric variable, rather than coding them as indicator variables (AKA "one-hot encoding"). Level coding can be computationally and ... [Read more...]

Teaching pivot / un-pivot

April 11, 2017 | Nina Zumel

Authors: John Mount and Nina Zumel Introduction In teaching thinking in terms of coordinatized data we find the hardest operations to teach are joins and pivot. One thing we commented on is that moving data values into columns, or into a “thin” or entity/attribute/value form (often called “un-pivoting”, “...
[Read more...]

A Simple Example of Using replyr::gapply

December 19, 2016 | Nina Zumel

It’s a common situation to have data from multiple processes in a “long” data format, for example a table with columns measurement and process_that_produced_measurement. It’s also natural to split that data apart to analyze or transform it, per-process — and then to bring the results of ...
[Read more...]

Using replyr::let to Parameterize dplyr Expressions

December 6, 2016 | Nina Zumel

Imagine that in the course of your analysis, you regularly require summaries of numerical values. For some applications you want the mean of that quantity, plus/minus a standard deviation; for other applications you want the median, and perhaps an interval around the median based on the interquartile range (IQR). ...
[Read more...]

Principal Components Regression, Pt. 2: Y-Aware Methods

May 23, 2016 | Nina Zumel

In our previous note, we discussed some problems that can arise when using standard principal components analysis (specifically, principal components regression) to model the relationship between independent (x) and dependent (y) variables. In this note, we present some dimensionality reduction techniques that alleviate some of those problems, in particular what ...
[Read more...]

Principal Components Regression, Pt.1: The Standard Method

May 16, 2016 | Nina Zumel

In this note, we discuss principal components regression and some of the issues with it: The need for scaling. The need for pruning. The lack of “y-awareness” of the standard dimensionality reduction step. The purpose of this article is to set the stage for presenting dimensionality reduction techniques appropriate for ...
[Read more...]

Finding the K in K-means by Parametric Bootstrap

February 10, 2016 | Nina Zumel

One of the trickier tasks in clustering is determining the appropriate number of clusters. Domain-specific knowledge is always best, when you have it, but there are a number of heuristics for getting at the likely number of clusters in your data. We cover a few of them in Chapter 8 (available ...
[Read more...]

Using PostgreSQL in R: A quick how-to

February 1, 2016 | Nina Zumel

The combination of R plus SQL offers an attractive way to work with what we call medium-scale data: data that’s perhaps too large to gracefully work with in its entirety within your favorite desktop analysis tool (whether that be R or Excel), but too small to justify the overhead ...
[Read more...]

Upcoming Win-Vector Appearances

November 9, 2015 | Nina Zumel

We have two public appearances coming up in the next few weeks: Workshop at ODSC, San Francisco – November 14 Both of us will be giving a two-hour workshop called Preparing Data for Analysis using R: Basic through Advanced Techniques. We will cover key issues in this important but often neglected aspect ... [Read more...]

Our Differential Privacy Mini-series

November 1, 2015 | Nina Zumel

We’ve just finished off a series of articles on some recent research results applying differential privacy to improve machine learning. Some of these results are pretty technical, so we thought it was worth working through concrete examples. And some of the original results are locked behind academic journal paywalls, ...
[Read more...]

A Simpler Explanation of Differential Privacy

October 2, 2015 | Nina Zumel

Differential privacy was originally developed to facilitate secure analysis over sensitive data, with mixed success. It’s back in the news again now, with exciting results from Cynthia Dwork, et. al. (see references at the end of the article) that apply results from differential privacy to machine learning. In this ...
[Read more...]

How do you know if your model is going to work?

September 22, 2015 | Nina Zumel

Authors: John Mount (more articles) and Nina Zumel (more articles). Our four part article series collected into one piece. Part 1: The problem Part 2: In-training set measures Part 3: Out of sample procedures Part 4: Cross-validation techniques “Essentially, all models are wrong, but some are useful.” George Box Here’s a caricature of ...
[Read more...]

Bootstrap Evaluation of Clusters

September 4, 2015 | Nina Zumel

Illustration from Project Gutenberg The goal of cluster analysis is to group the observations in the data into clusters such that every datum in a cluster is more similar to other datums in the same cluster than it is to datums in other clusters. This is an analysis method of ...
[Read more...]

Working with Sessionized Data 2: Variable Selection

July 15, 2015 | Nina Zumel

In our previous post in this series, we introduced sessionization, or converting log data into a form that’s suitable for analysis. We looked at basic considerations, like dealing with time, choosing an appropriate dataset for training models, and choosing appropriate (and achievable) business goals. In that previous example, we ... [Read more...]
1 2 3

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)