Articles by Nina Zumel

Simpson’s Paradox in a Logistic Regression

February 6, 2025 | Nina Zumel

Simpson’s paradox is when a trend that is present in various groups of data seems to disappear or even reverse when those groups are combined. One sees examples of this often in things like medical trials, and the phenomenon is generally due to ...
[Read more...]

Dyson’s Algorithm: The General Case

January 23, 2025 | Nina Zumel

Photo by Marco Verch, CC-2.0. Source In a previous post, we looked at Dyson’s algorithm (Dyson 1946) for solving the Coins in Weighings problem : You have coins, to appearance exactly identical; but possibly one is counterfeit (we’l...
[Read more...]

The Perplexed Banker

December 19, 2024 | Nina Zumel

From Henry Dudeney’s “Perplexities” article in the March 1925 issue of The Strand Magazine: A man went into a bank with 1,000 silver dollars and 10 bags. He said, “Place this money, please, in the bags in such a way that if I call and ask for a...
[Read more...]

Bachet’s Four Weights Problem

December 5, 2024 | Nina Zumel

Here’s another puzzle from Henry Dudeney’s article “The World’s Best Puzzles,” The Strand Magazine, December 1908. According to Dudeney, this puzzle is originally from Problèmes plaisans et délectables qui se font par les nombres (Pleasant and d...
[Read more...]

100 Bushels of Corn

November 14, 2024 | Nina Zumel

About the author Nina Zumel is a data scientist based in San Francisco, with 20+ years of experience in machine learning, statistics, and analytics. She is the co-founder of the data science consulting firm Win-Vector LLC, and (with Joh...
[Read more...]

Post-hoc Adjustment for Zero-Thresholded Linear Models

August 16, 2024 | Nina Zumel

Suppose you are modeling a process that you believe is well approximated as being linear in its inputs, but only within a certain range. Outside that range, the output might saturate or threshold: for example if you are modeling a count or a physical process, you likely can never get […]
[Read more...]

Exploring the XI Correlation Coefficient

December 29, 2021 | Nina Zumel

Nina Zumel Recently, we’ve been reading about a new correlation coefficient, \(\xi\) (“xi”), which was introduced by Professor Sourav Chatterjee in his paper, “A New Coefficient of Correlation”. The \(\xi\) coefficient has the following properties: If \(y\) is a function of \(x\), then \(\xi\) goes to 1 asymptotically as \(n\) […]
[Read more...]

When Cross-Validation is More Powerful than Regularization

November 12, 2019 | Nina Zumel

Regularization is a way of avoiding overfit by restricting the magnitude of model coefficients (or in deep learning, node weights). A simple example of regularization is the use of ridge or lasso regression to fit linear models in the presence of collinear variables or (quasi-)separation. The intuition is that ...
[Read more...]

Why Do We Plot Predictions on the x-axis?

September 27, 2019 | Nina Zumel

When studying regression models, One of the first diagnostic plots most students learn is to plot residuals versus the model’s predictions (that is, with the predictions on the x-axis). Here’s a basic example. # build an "ideal" linear process. set.seed(34524) N = 100 x1 = runif(N) x2 = runif(N) noise = 0.25*...
[Read more...]

WVPlots 1.1.2 on CRAN

September 12, 2019 | Nina Zumel

I have put a new release of the WVPlots package up on CRAN. This release adds palette and/or color controls to most of the plotting functions in the package. WVPlots was originally a catch-all package of ggplot2 visualizations that we at Win-Vector tended to use repeatedly, and wanted to ...
[Read more...]

An Ad-hoc Method for Calibrating Uncalibrated Models

July 16, 2019 | Nina Zumel

In the previous article in this series, we showed that common ensemble models like random forest and gradient boosting are uncalibrated: they are not guaranteed to estimate aggregates or rollups of the data in an unbiased way. However, they can be preferable to calibrated models such as linear or generalized ...
[Read more...]

Common Ensemble Models can be Biased

July 11, 2019 | Nina Zumel

In our previous article , we showed that generalized linear models are unbiased, or calibrated: they preserve the conditional expectations and rollups of the training data. A calibrated model is important in many applications, particularly when financial data is involved. However, when making predictions on individuals, a biased model may be ...
[Read more...]

Link Functions versus Data Transforms

July 7, 2019 | Nina Zumel

In the linear regression section of our book Practical Data Science in R, we use the example of predicting income from a number of demographic variables (age, sex, education and employment type). In the text, we choose to regress against log10(income) rather than directly against income. One obvious reason ...
[Read more...]

Cohen’s D for Experimental Planning

June 18, 2019 | Nina Zumel

In this note, we discuss the use of Cohen’s D for planning difference-of-mean experiments. Estimating sample size Let’s imagine you are testing a new weight loss program and comparing it so some existing weight loss regimen. You want to run an experiment to determine if the new program ...
[Read more...]

PDSwR2: New Chapters!

February 6, 2019 | Nina Zumel

We have two new chapters of Practical Data Science with R, Second Edition online and available for review! The newly available chapters cover: Data Engineering And Data Shaping – Explores how to use R to organize or wrangle data into a shape useful for analysis. The chapter covers applying data transforms, ...
[Read more...]

More on sigr

November 6, 2018 | Nina Zumel

If you’ve read our previous R Tip on using sigr with linear models, you might have noticed that the lm() summary object does in fact carry the R-squared and F statistics, both in the printed form: model_lm [Read more...]
1 2 3

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)