Articles by Nina Zumel

Simpson’s Paradox in a Logistic Regression

February 6, 2025 | Nina Zumel

Simpson’s paradox is when a trend that is present in various groups of data seems to disappear or even reverse when those groups are combined. One sees examples of this often in things like medical trials, and the phenomenon is generally due to ...

$\beta_{exercise}$

Dyson’s Algorithm: The General Case

January 23, 2025 | Nina Zumel

Photo by Marco Verch, CC-2.0. Source In a previous post, we looked at Dyson’s algorithm (Dyson 1946) for solving the Coins in Weighings problem : You have coins, to appearance exactly identical; but possibly one is counterfeit (we’l...

Dyson’s Algorithm for the Twelve Coins Problem

January 9, 2025 | Nina Zumel

Photo by Marco Verch, CC-2.0. Source The Twelve Coins Problem is a notoriously hard problem that comes in many flavors. I don’t know where it comes from originally, but it garnered quite a bit of attention from mathematicians in the mid-twe...

The Perplexed Banker

December 19, 2024 | Nina Zumel

From Henry Dudeney’s “Perplexities” article in the March 1925 issue of The Strand Magazine: A man went into a bank with 1,000 silver dollars and 10 bags. He said, “Place this money, please, in the bags in such a way that if I call and ask for a...

Writing Signed Trinary: or, Back To the Four Weights Problem

December 12, 2024 | Nina Zumel

Recently on Puzzle Corner, we looked at the Four Weights problem: what are the four weights that can weigh any object in the (integer) range 1 to 40 on a balance scale? A balance scale and four weights The puzzle just asks you to find the ...

$x$

Bachet’s Four Weights Problem

December 5, 2024 | Nina Zumel

Here’s another puzzle from Henry Dudeney’s article “The World’s Best Puzzles,” The Strand Magazine, December 1908. According to Dudeney, this puzzle is originally from Problèmes plaisans et délectables qui se font par les nombres (Pleasant and d...

100 Bushels of Corn

November 14, 2024 | Nina Zumel

About the author Nina Zumel is a data scientist based in San Francisco, with 20+ years of experience in machine learning, statistics, and analytics. She is the co-founder of the data science consulting firm Win-Vector LLC, and (with Joh...

$\begin{aligned} m + w + c &= 100 \\ 3m + 2w + 0.5c &= 100 \\ \end{aligned}$

Post-hoc Adjustment for Zero-Thresholded Linear Models

August 16, 2024 | Nina Zumel

Suppose you are modeling a process that you believe is well approximated as being linear in its inputs, but only within a certain range. Outside that range, the output might saturate or threshold: for example if you are modeling a count or a physical process, you likely can never get […]

Exploring the XI Correlation Coefficient

December 29, 2021 | Nina Zumel

Nina Zumel Recently, we’ve been reading about a new correlation coefficient, \(\xi\) (“xi”), which was introduced by Professor Sourav Chatterjee in his paper, “A New Coefficient of Correlation”. The \(\xi\) coefficient has the following properties: If \(y\) is a function of \(x\), then \(\xi\) goes to 1 asymptotically as \(n\) […]

Linear and Logistic Regression in Practical Data Science with R 2nd Edition

June 1, 2020 | Nina Zumel

One of the chapters that we are especially proud of in Practical Data Science with R is Chapter 7, “Linear and Logistic Regression.” We worked really hard to explain the fundamental principles behind both methods in a clear and easy-to-understand form, and to document diagnostics returned by the R implementations of ... [Read more...]

Monitoring for Changes in Distribution with Resampling Tests

February 11, 2020 | Nina Zumel

A client recently came to us with a question: what’s a good way to monitor data or model output for changes? That is, how can you tell if new data is distributed differently from previous data, or if the distribution of scores returned by a model have changed? This ...

When Cross-Validation is More Powerful than Regularization

November 12, 2019 | Nina Zumel

Regularization is a way of avoiding overfit by restricting the magnitude of model coefficients (or in deep learning, node weights). A simple example of regularization is the use of ridge or lasso regression to fit linear models in the presence of collinear variables or (quasi-)separation. The intuition is that ...

Why Do We Plot Predictions on the x-axis?

September 27, 2019 | Nina Zumel

When studying regression models, One of the first diagnostic plots most students learn is to plot residuals versus the model’s predictions (that is, with the predictions on the x-axis). Here’s a basic example. # build an "ideal" linear process. set.seed(34524) N = 100 x1 = runif(N) x2 = runif(N) noise = 0.25*...

WVPlots 1.1.2 on CRAN

September 12, 2019 | Nina Zumel

I have put a new release of the WVPlots package up on CRAN. This release adds palette and/or color controls to most of the plotting functions in the package. WVPlots was originally a catch-all package of ggplot2 visualizations that we at Win-Vector tended to use repeatedly, and wanted to ...

An Ad-hoc Method for Calibrating Uncalibrated Models

July 16, 2019 | Nina Zumel

In the previous article in this series, we showed that common ensemble models like random forest and gradient boosting are uncalibrated: they are not guaranteed to estimate aggregates or rollups of the data in an unbiased way. However, they can be preferable to calibrated models such as linear or generalized ...

Common Ensemble Models can be Biased

July 11, 2019 | Nina Zumel

In our previous article , we showed that generalized linear models are unbiased, or calibrated: they preserve the conditional expectations and rollups of the training data. A calibrated model is important in many applications, particularly when financial data is involved. However, when making predictions on individuals, a biased model may be ...

Link Functions versus Data Transforms

July 7, 2019 | Nina Zumel

In the linear regression section of our book Practical Data Science in R, we use the example of predicting income from a number of demographic variables (age, sex, education and employment type). In the text, we choose to regress against log10(income) rather than directly against income. One obvious reason ...

Cohen’s D for Experimental Planning

June 18, 2019 | Nina Zumel

In this note, we discuss the use of Cohen’s D for planning difference-of-mean experiments. Estimating sample size Let’s imagine you are testing a new weight loss program and comparing it so some existing weight loss regimen. You want to run an experiment to determine if the new program ...

PDSwR2: New Chapters!

February 6, 2019 | Nina Zumel

We have two new chapters of Practical Data Science with R, Second Edition online and available for review! The newly available chapters cover: Data Engineering And Data Shaping – Explores how to use R to organize or wrangle data into a shape useful for analysis. The chapter covers applying data transforms, ...

More on sigr

November 6, 2018 | Nina Zumel

If you’ve read our previous R Tip on using sigr with linear models, you might have noticed that the lm() summary object does in fact carry the R-squared and F statistics, both in the printed form: model_lm [Read more...]

1 2 3 »

Copyright © 2025 | MH Corporate basic by MH Themes