Site icon R-bloggers

Regressions 101: “Significance”

[This article was first published on isomorphismes, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
SETUP (CAN BE SKIPPED)

We start with data (how was it collected?) and the hope that we can compare them. We also start with a question which is of the form:

bearing in mind that this response-magnitude may differ under varying circumstances. (Raising morning-beauty-prep time from 1 minute to 10 minutes will do more than raising 110 minutes to 120 minutes of prep. Also there may be interaction terms like you need both a petroleum engineering degree and to live in one of {Naija, Indonesia, Alaska, Kazakhstan, Saudi Arabia, Oman, Qatar} in order to see the income bump. Also many of these questions have a time-factor, like the MBA and the climate ones.)

As Trygve Haavelmo put it: using reason alone we can probably figure out which direction each of these responses will go. But knowing just that raising the tax rate will drive away some number of rich doesn’t push the debate very far—if all you lose is a handful of symbolic Eduardo Saverins who were already on the cusp of fleeing the country, then bringing up the Laffer curve is chaff. But if the number turns out to be large then it’s really worth discussing.

In less polite terms: until we quantify what we’re debating about, you can spit bollocks all day long. Once the debate is quantified then the discussion should become way more intelligent, less derailing to irrelevant theoretically-possible-issues-which-are-not-really-worth-wasting-time-on.

So we change one variable over which we have control and measure how the interesting thing responds. Once we measure both we come to the regression stage where we try to make a statement of the form “A 30% increase in effort will result in a 10% increase in wage” or “5 extra minutes getting ready in the morning will make me look 5% better”. (You should agree from those examples that the same number won’t necessarily hold throughout the whole range. Like if I spend three hours getting ready the returns will have diminished from the returns on the first five minutes.)

Avoiding causal language, we say that a 10% increase in (your salary) is associated with a 30% increase in (your effort).

 
MAIN PART (SKIP TO HERE IF SKIMMING)

The two numbers that jump out of any regression table output (e.g., lm in R) are p and β.

Wary that regression tables spit out many, many numbers (like Durbin-Watson statistic, F statistic, Akaike Information, and more) specifically to measure potential problems with interpreting β and p naïvely, here are pictures of the textbook situations where p and β can be interpreted in the straightforward way:

First, the standard cases where the regression analysis works as it should and how to read it is fairly obvious:
(NB: These are continuous variables rather than on/off switches or ordered categories. So instead of “Followed the weight-loss regimen” or “Didn’t follow the weight-loss regimen” it’s someone quantified how much it was followed. Again, actual measurements (how they were coded) getting in the way of our gleeful playing with numbers.)



Second, the case I want to draw attention to: a small statistical significance doesn’t necessarily mean nothing’s going on there.


The code I used to generate these fake-data and plots.

If the regression measures a high β but low confidence (high p), that is still worth taking a look at. If regression picks up wide dispersion in male-versus-female wages—let’s say double—but we’re not so confident (high p) that it’s exactly double because it’s sometimes 95%, sometimes 180%, sometimes 310%, we’ve still picked up a significant effect.

The exact value of β would not be statistically significant or confidently precise due to a high p but actually this would be a very significant finding. (Try it the same with any of my other examples, or another quantitative-comparison scenario you think up. It’s either a serious opportunity, or a serious problem, that you’ve uncovered. Just needs further looking to see where the variation around double comes from.)

You can read elsewhere about how awful it is that p<.05 is the password for publishable science, for many reasons that require some statistical vocabulary. But I think the most intuitive problem is the one I just stated. If your geiger counter flips out to ten times the deadly level of radiation, it doesn’t matter if it sometimes reads 8, sometimes 5, and sometimes 15—the point is, you need to be worried and get the h*** out of there. (Unless the machine is wacked—but you’d still be spooked, wouldn’t you?)

 
FOLLOW-UP (CAN BE SKIPPED)

The scale of β is the all-important thing that we are after. Small differences in βs of variables that are important to your life can make a huge difference.





Order-of-magnitude differences (like 20 versus 2) is the difference between fly and dog; between life in the USA and near-famine; between oil tanker and gas pump; between Tibet’s altitude and Illinois’; between driving and walking; even the Black Death was only a tenth of an order of magnitude of reduction in human population.




Keeping in mind that calculus tells us that nonlinear functions can be approximated in a local region by linear functions (unless the nonlinear function jumps), β is an acceptable measure of “Around the current levels of webspeed” or “Around the current levels of taxation” how does the interesting thing respond.



Linear response magnitudes can also be used to estimate global responses in a nonlinear function, but you will be quantifying something other than the local linear approximation.

To leave a comment for the author, please follow the link and comment on their blog: isomorphismes.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.