Articles by Nina Zumel

Working with Sessionized Data 1: Evaluating Hazard Models

July 8, 2015 | Nina Zumel

When we teach data science we emphasize the data scientist’s responsibility to transform available data from multiple systems of record into a wide or denormalized form. In such a “ready to analyze” form each individual example gets a row of data and every fact about the example is a ... [Read more...]

Wanted: A Perfect Scatterplot (with Marginals)

June 11, 2015 | Nina Zumel

We saw this scatterplot with marginal densities the other day, in a blog post by Thomas Wiecki: The graph was produced in Python, using the seaborn package. Seaborn calls it a “jointplot;” it’s called a “scatterhist” in Matlab, apparently. The seaborn version also shows the strength of the linear ... [Read more...]

Does Balancing Classes Improve Classifier Performance?

February 27, 2015 | Nina Zumel

It’s a folk theorem I sometimes hear from colleagues and clients: that you must balance the class prevalence before training a classifier. Certainly, I believe that classification tends to be easier when the classes are nearly balanced, especially when the class you are actually interested in is the rarer ... [Read more...]

The Geometry of Classifiers

December 18, 2014 | Nina Zumel

As John mentioned in his last post, we have been quite interested in the recent study by Fernandez-Delgado, et.al., “Do we Need Hundreds of Classifiers to Solve Real World Classification Problems?” (the “DWN study” for short), which evaluated 179 popular implementations of common classification algorithms over 120 or so data sets, ... [Read more...]

Estimating Generalization Error with the PRESS statistic

September 25, 2014 | Nina Zumel

As we’ve mentioned on previous occasions, one of the defining characteristics of data science is the emphasis on the availability of “large” data sets, which we define as “enough data that statistical efficiency is not a concern” (note that a “large” data set need not be “big data,” however ... [Read more...]

Vtreat: designing a package for variable treatment

August 7, 2014 | Nina Zumel

When you apply machine learning algorithms on a regular basis, on a wide variety of data sets, you find that certain data issues come up again and again: Missing values (NA or blanks) Problematic numerical values (Inf, NaN, sentinel values like 999999999 or -1) Valid categorical levels that don’t appear ... [Read more...]

Trimming the Fat from glm() Models in R

May 30, 2014 | Nina Zumel

One of the attractive aspects of logistic regression models (and linear models in general) is their compactness: the size of the model grows in the number of coefficients, not in the size of the training data. With R, though, glm models are not so concise; we noticed this to our ... [Read more...]

Bandit Formulations for A/B Tests: Some Intuition

April 24, 2014 | Nina Zumel

Controlled experiments embody the best scientific design for establishing a causal relationship between changes and their influence on user-observable behavior. – Kohavi, Henne, Sommerfeld, “Practical Guide to Controlled Experiments on the Web” (2007) A/B tests are one of the simplest ways of running controlled experiments to evaluate the efficacy of a ... [Read more...]

Practical Data Science with R: Release date announced

March 25, 2014 | Nina Zumel

It took a little longer than we’d hoped, but we did it! Practical Data Science with R will be released on April 2nd (physical version). The eBook version will follow soon after, on April 15th. You can preorder the pBook now on the Manning book page. The physical version ... [Read more...]

The Statistics behind “Verification by Multiplicity”

March 1, 2014 | Nina Zumel

There’s a new post up at the ninazumel.com blog that looks at the statistics of “verification by multiplicity” — the statistical technique that is behind NASA’s announcement of 715 new planets that have been validated in the data from the Kepler Space Telescope. We normally don’t write about ... [Read more...]

The Extra Step: Graphs for Communication versus Exploration

January 12, 2014 | Nina Zumel

Visualization is a useful tool for data exploration and statistical analysis, and it’s an important method for communicating your discoveries to others. While those two uses of visualization are related, they aren’t identical. One of the reasons that I like ggplot so much is that it excels at ... [Read more...]

Big News! Practical Data Science with R is content complete!

December 19, 2013 | Nina Zumel

The last appendix has gone to the editors; the book is now content complete. What a relief! We are hoping to release the book late in the first quarter of next year. In the meantime, you can still get early drafts of our chapters through Manning’s Early Access program, ... [Read more...]

Bayesian and Frequentist Approaches: Ask the Right Question

May 6, 2013 | Nina Zumel

It occurred to us recently that we don’t have any articles about Bayesian approaches to statistics here. I’m not going to get into the “Bayesian versus Frequentist” war; in my opinion, which style of approach to use is less about philosophy, and more about figuring out the best ... [Read more...]

Revisiting Cleveland’s The Elements of Graphing Data in ggplot2

February 18, 2013 | Nina Zumel

I was flipping through my copy of William Cleveland’s The Elements of Graphing Data the other day; it’s a book worth revisiting. I’ve always liked Cleveland’s approach to visualization as statistical analysis. His quest to ground visualization principles in the context of human visual cognition (he ... [Read more...]

Error Handling in R

October 9, 2012 | Nina Zumel

It’s often the case that I want to write an R script that loops over multiple datasets, or different subsets of a large dataset, running the same procedure over them: generating plots, or fitting a model, perhaps. I set the script running and turn to another task, only to ... [Read more...]

Modeling Trick: Impact Coding of Categorical Variables with Many Levels

July 23, 2012 | Nina Zumel

One of the shortcomings of regression (both linear and logistic) is that it doesn’t handle categorical variables with a very large number of possible values (for example, postal codes). You can get around this, of course, by going to another modeling technique, such as Naive Bayes; however, you lose ... [Read more...]

My Favorite Graphs

December 5, 2011 | Nina Zumel

The important criterion for a graph is not simply how fast we can see a result; rather it is whether through the use of the graph we can see something that would have been harder to see otherwise or that could not have been seen at all. – William Cleveland, The ... [Read more...]

« 1 2 3

R-bloggers

R news and tutorials contributed by hundreds of R bloggers

Articles by Nina Zumel

Working with Sessionized Data 1: Evaluating Hazard Models

Wanted: A Perfect Scatterplot (with Marginals)

Does Balancing Classes Improve Classifier Performance?

The Geometry of Classifiers

Estimating Generalization Error with the PRESS statistic

Vtreat: designing a package for variable treatment

Trimming the Fat from glm() Models in R

Bandit Formulations for A/B Tests: Some Intuition

Practical Data Science with R: Release date announced

The Statistics behind “Verification by Multiplicity”

The Extra Step: Graphs for Communication versus Exploration

Big News! Practical Data Science with R is content complete!

Bayesian and Frequentist Approaches: Ask the Right Question

Revisiting Cleveland’s The Elements of Graphing Data in ggplot2

Error Handling in R

Modeling Trick: Impact Coding of Categorical Variables with Many Levels

My Favorite Graphs

Articles by Nina Zumel

Never miss an update! Subscribe to R-bloggers to receive e-mails with the latest R posts. (You will not see this message again.)

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)