Call them what you will
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
I’ve been playing around with the R package texreg for creating combined regression tables for multiple models. It’s not the only package to do that – see here for a review – but it’s often handy to be able to generate both ascii art, latex, and html versions of the same table using almost identical syntax. Also, the ascii art creating screenreg function allows me to bypass the pdf construction cycle I previously described here. The coefficient plots from plotreg are pretty cool too.
This post is about making the variables listed in those combined regression tables more readable. That is particularly important when data comes from variable-mangling statistical software or from co-authors whose idea of a descriptive name could pass for an online banking password. Even R will cheerfully mash up your carefully chosen variable names through formulas, factors, and interactions. So for work people are going to see, variables should have sensible names.
First I’ll walk through the existing texreg machinery for renaming, omitting, and reordering variables, and then propose a hopefully more intuitive implementation. I’ll demonstrate all this using screenreg on a classic data set on job prestige.
Job prestige
Consider the ‘Prestige’ data in the car package. This dataset gives information about various occupations collected from census data in Canada in the early 1970s. For each occupation there’s a level of job prestige and some covariates that potentially explain that level. The covariates are the average income of that occupation in dollars, the average years of education in years, the percentage of women it contained, a census occupational code that we can ignore, and a nominal variable called type that takes the values blue collar (‘bc’), white collar (‘wc’), and professional (‘prof’). Let’s tidy the data and construct a couple of models to show texreg at work.
require(car) require(texreg) data(Prestige) # make professional the base category and rebase income prest <- transform(Prestige, type=relevel(type, 'prof'), income=income/1000) # Income has a single relationship to job prestige mod1 <- lm(prestige ~ education + type + income + women, data=prest) # Income has different relationship to job prestige depending on job type mod2 <- lm(prestige ~ education + type * income + women, data=prest)
Now let's compare these two models
screenreg(list(mod1, mod2)) ==================================== Model 1 Model 2 ------------------------------------ (Intercept) 5.09 17.47 * (8.87) (8.19) education 3.66 *** 2.80 *** (0.65) (0.59) typebc -5.91 -27.55 *** (3.94) (5.41) typewc -8.82 ** -24.12 *** (2.79) (5.35) income 1.04 *** 0.85 *** (0.26) (0.23) women 0.01 0.08 * (0.03) (0.03) typebc:income 3.01 *** (0.58) typewc:income 1.91 * (0.81) ------------------------------------ R^2 0.83 0.87 Adj. R^2 0.83 0.86 Num. obs. 98 98 ==================================== *** p < 0.001, ** p < 0.01, * p < 0.05
So far so good. But these variable names are automatically constructed and inevitably a bit weird-looking. I mean, 'typebc'?
Making better variable names
We can rename variables before constructing the table using the custom.coef.names argument.
screenreg(list(mod1, mod2), custom.coef.names=c('intercept', 'education', 'type: blue collar', 'type: white collar', 'income', 'prop. female', 'income x blue collar', 'income x white collar')) ============================================ Model 1 Model 2 -------------------------------------------- intercept 5.09 17.47 * (8.87) (8.19) education 3.66 *** 2.80 *** (0.65) (0.59) type: blue collar -5.91 -27.55 *** (3.94) (5.41) type: white collar -8.82 ** -24.12 *** (2.79) (5.35) income 1.04 *** 0.85 *** (0.26) (0.23) prop. female 0.01 0.08 * (0.03) (0.03) income x blue collar 3.01 *** (0.58) income x white collar 1.91 * (0.81) -------------------------------------------- R^2 0.83 0.87 Adj. R^2 0.83 0.86 Num. obs. 98 98 ============================================ *** p < 0.001, ** p < 0.01, * p < 0.05
That's more readable already.
Omitting things
Actually maybe we don't care about the intercept because not all of these variables are meaningful at zero, so let's omit that.
screenreg(list(mod1, mod2), custom.coef.names=c('intercept', 'education', 'type: blue collar', 'type: white collar', 'income', 'prop. female', 'income x blue collar', 'income x white collar'), omit.coef='intercept') ============================================ Model 1 Model 2 -------------------------------------------- education 3.66 *** 2.80 *** (0.65) (0.59) type: blue collar -5.91 -27.55 *** (3.94) (5.41) type: white collar -8.82 ** -24.12 *** (2.79) (5.35) income 1.04 *** 0.85 *** (0.26) (0.23) prop. female 0.01 0.08 * (0.03) (0.03) income x blue collar 3.01 *** (0.58) income x white collar 1.91 * (0.81) -------------------------------------------- R^2 0.83 0.87 Adj. R^2 0.83 0.86 Num. obs. 98 98 ============================================ *** p < 0.001, ** p < 0.01, * p < 0.05
this doesn't seem such a big deal, but in other analyses when you've got, for example, 120 country fixed effects in a variable called country that you don't want to see then it's nice to be able to say
omit.coef='country'
This gets treated like a regular expression and omits all the 'countryAngola' type things. If you've also got, say, regional fixed effects and you want to skip them too then it would be
omit.coef='country|region'
We'll use this little trick later.
Reordering things
This prettification is all very nice, but arguably the coefficients appear in the wrong order. Specifically, we should probably see the interaction between type and income directly after the income and type 'main effects' without the intervening women variable. To do this we can use the reorder.coef argument
screenreg(list(mod1, mod2), custom.coef.names=c('intercept', 'education', 'type: blue collar', 'type: white collar', 'income', 'prop. female', 'income x blue collar', 'income x white collar'), omit.coef='intercept', reorder.coef=c(1, 4, 2, 3, 6, 7, 5)) ============================================ Model 1 Model 2 -------------------------------------------- education 3.66 *** 2.80 *** (0.65) (0.59) income 1.04 *** 0.85 *** (0.26) (0.23) type: blue collar -5.91 -27.55 *** (3.94) (5.41) type: white collar -8.82 ** -24.12 *** (2.79) (5.35) income x blue collar 3.01 *** (0.58) income x white collar 1.91 * (0.81) prop. female 0.01 0.08 * (0.03) (0.03) -------------------------------------------- R^2 0.83 0.87 Adj. R^2 0.83 0.86 Num. obs. 98 98 ============================================ *** p < 0.001, ** p < 0.01, * p < 0.05
Here we explicitly specified a reordering of the indexes of the original variable names.
This works, but it takes a little bit of trial and error to get all the indexing right. Basically, it's brittle. What happens if we decide to drop the female variable? Or move it to the top of the variable list? It's right in the middle of the reordering sequence and the list of new variable names so we'd have to construct these arguments again, more or less from scratch.
Let's think about this again and instead of exploring the existing functionality let's ask for what we want.
Asking for what we want
First, we'd like to be able to rename variables. That is, really rename them, not just assign a string to whatever variable happens to be number 2 in a list of other strings. Also, we'd usually like to be implicit about omission. A reasonable convention might be that if we don't bother to mention a variable then we don't ever want to see it in the table. And we'd like to be implicit about ordering. Normally we can provide a total ordering of variables, e.g. we can assert that if type and income are included we want income to appear first, and if they are interacted we want to see the interaction terms directly after type.
A straightforward way to express all these things would be to define a list that describes what new name each possible variable name should be mapped to, what sequence they should appear in, and which variables should be omitted in the final table. That is to say, something like
name.map <- list(education="education", income="income", typewc="type: white collar", typebc="type: blue collar", "typewc:income"="income x white collar", "typebc:income"="income x blue collar", women='prop. female')
This is just an ordinary R list, although note that R's default colon-containing interaction term names need to be quoted to stop R thinking they are sequences or some other foolishness. We could get around that with a little fancy programming, but let's not bother.
The order of the list is a total ordering over things we'd like to see in a regression table. Regression tables should show the subset of names in this list, in the order they appear there. Variable names that are not mapped are ones we never wish to see; there is no intercept mentioned because we always want to omit that.
Building something to get what we want
So much for asking for what we want. Now to construct a function that turns this little specification into something that screenreg can deal with.
The following function has two arguments: 'final.rnames': an array containing the variable names that would appear in screenreg output if we didn't change anything, and 'name.map' which we discussed above. It returns a list with three elements, 'ccn': a suitable value for the custom.coef.names argument, 'oc': a suitable value for the omit.coef argument, and 'rc': a suitable value for the reorder.coef argument.
build.ror <- function(final.rnames, name.map){ keep <- final.rnames %in% names(name.map) mapper <- function(x){ mp <- name.map[[x]] ifelse(is.null(mp), x, mp) } newnames <- sapply(final.rnames, mapper) omit <- paste0(final.rnames[!keep], collapse="|") reorder <- na.omit(match(unlist(name.map), newnames[keep])) list(ccn=newnames, oc=omit, rc=reorder) }
Aside
What is 'final.rnames' here? It's
c("(Intercept)", "education", "typebc", "typewc", "income", "women", "typebc:income", "typewc:income")
But where did it come from?
Well, we could have just typed in these names. But really, life is short, and it's more fun to use a probably-ill-advised and certainly contrary-to-all-the-etiquette-of-R function like the one below to figure out the variable names
all.varnames.dammit <- function(model.list){ mods <- texreg:::get.data(model.list) gofers <- texreg:::get.gof(mods) mm <- texreg:::aggregate.matrix(mods, gofers, digits=3) rownames(mm) }
This takes the same list of models as you'd give screenreg and tells you what the final variable list would be. Handy, eh? Maybe there is a kosher method of getting hold of this info, but I don't know what it is.
As it happens, this function just does what texreg does internally. All those triple colons are there because the author didn't really want us messing around in the unexported functions the way he does. Shenanigan-free exposure of this variable name information would, of course, be preferable. I'll ask... In the meantime, use at your own risk.
Getting what we want
Now we've now got our original variable name list and a mapping function let's put build.ror to use.
# bundle up some models model.list <- list(mod1, mod2) # get the names of the coefficients in all the models oldnames <- all.varnames.dammit(model.list) # construct three suitable arguments for screenreg ror <- build.ror(oldnames, name.map) # add them as arguments to a screenreg call screenreg(model.list, custom.coef.names=ror$ccn, omit.coef=ror$oc, reorder.coef=ror$rc) ============================================ Model 1 Model 2 -------------------------------------------- education 3.66 *** 2.80 *** (0.65) (0.59) income 1.04 *** 0.85 *** (0.26) (0.23) type: white collar -8.82 ** -24.12 *** (2.79) (5.35) type: blue collar -5.91 -27.55 *** (3.94) (5.41) income x white collar 1.91 * (0.81) income x blue collar 3.01 *** (0.58) prop. female 0.01 0.08 * (0.03) (0.03) -------------------------------------------- R^2 0.83 0.87 Adj. R^2 0.83 0.86 Num. obs. 98 98 ============================================ *** p < 0.001, ** p < 0.01, * p < 0.05
and there we are.
To put the proportion of women variable first, just move it to the front of the name.map and run the whole thing again. No more index calculations in your head. To omit a variable, just take it out of the list.
It doesn't matter if there are names in name.map that never appear as variables in any of your models. You should feel free to put a project's worth of variable name mappings in there, just in case.
Finally, all this works with texreg and htmlreg too (but not for plotreg because that doesn't combine models). If you use it to create latex, do make sure your new variable names are properly escaped before they meet the latex compiler. Percentage and dollars signs are the usual culprits.
Extensions
It would be natural to build something like these functions into texreg itself. Or at the very least, bundle the two functions above together so that we could just hand in a list of models and a variable name mapping. Maybe I'll save that for another post.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.