How to stop worrying and start using R packages efficiently for Econometrics

Posted on July 27, 2021 by Pachá in R bloggers | 0 Comments

[This article was first published on Pachá, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Dedicated to all those who write stupid comments about people with disabilities, LGBT+, non-whites and women, and specially to those who promote false inclusion. As always, this article is a personal opinion and does not involve the people who work with me unless they explicitly agree to this words.

Context

When you do econometrics, using R is not straightforward immediately and for most users it takes time to learn to use it as you would use Stata or SPSS. R has a steep learning curve that we can flatten with a few additions to the vanilla R setup.

If you are just trying R and you come from Stata, you might been already shocked by at least one of these (it happened to me back in 2015):

Is not easy to compute clustered standard errors for linear models
Obtaining a Pseudo-R2 for generalized linear models needs additional work
There are different packages for similar usages
Not all packages are compatible with each other

Probably you think that R is hard to use, and that it’s transparency comes with that additional difficulty. But if you use packages created by the community, such as the superb tidyverse, then you’ll have access to functions that do not come with R when you install it, but those ease your R work a lot (think about esttout in Stata!).

But how about other packages? sometimes it’s hard to find steps by steps tutorials like ‘In Stata you do X, in R you do Y’. Here I’ll explain a bit of econometrics using R and how to create a collection of packages.

Think about R packages as browser extensions, these serve a particular purpose for a particular group of users. For example, not all Firefox/Chrome users need the Enhancer for YouTube extension as not all users watch YouTube. R default installation is meant to be light and for general usage.

You might have found Stackoverflow and the #rstats hashtag on Twitter where there’s people who help a lot, and there are nice communities such as rOpenSci. But, being R a very large (and growing) group of people, don’t forget it’s a sample from the human race, and therefore you’ll find tutorials written in a language that’s not different from what the intersection of many cults and pseudo-science (not as bad as crypto-bros of course) would write, and there will be people that reply ‘Google it’ when you have a problem that you can’t figure out. Don’t ever get dissapointed, please! Ask the right people.

Just send some shade to those who don’t contribute to a healthy community.

Fortunately, if you belong to an under-represented group or you are just starting with R, there’s an important fraction of the R community who are truly inclusive and do not have low self-esteem to the point of feeling frightened by people who are different than them. Please count on the valuable ones and if you see something, say something.

Installing R packages from CRAN (i.e. the official source)

CRAN, the official server for R packages, does a very good job at providing some packages such as

dplyr: A fast, consistent tool for working with data frame like objects, both in memory and out of memory.
ggplot2: A system for ‘declaratively’ creating graphics, based on “The Grammar of Graphics”.
msm: Multi-State Markov and Hidden Markov Models in Continuous Time

But there are more packages outside CRAN, or that were removed from it due to policy changes, etc. One of these is RVAideMemoire, which provides ‘Testing and Plotting Procedures for Biostatistics’ and which is very helpful to work with generalized linear models.

Dplyr and ggplot2 are largely covered on many other blogs, both are part of the Tidyverse, which is like a bag containing different packages.

You can install R packages with install.packages("tidyverse") or similar. When you need to install from other sources, it gets more complicated.

I recommend to install packages from CRAN, unless you need a new functionality, because sometimes unofficial sources introduce experimental changes.

Installing packages from unofficial sources (i.e. GitHub)

As a way to ease packages installation from other sources than CRAN, we have the ‘remotes’ package. For example, the gravity package hasn’t been updated on CRAN due to some policy changes that prevent me from adding some improvements without taking detours. These detours are irrelevant for the end user, who can install the improved gravity version from the package repository.

The easy way would be:

install.packages("remotes")
remotes::install_github("pachadotdev/gravity")

And for packages removed from CRAN it’s the same:

install.packages("remotes")
remotes::install_github("pachadotdev/RVAideMemoire")

Of course, you can install from different repositories such as lrberge/fixest instead of install.packages("fixest"). Btw, fixest is beyond excellent and it’s very helpful to fit models with large effects.

Fitting models with the fixest package vs using base R

At the beginning of the post I mentioned things that are not easy to do with standard R.

As an example, say we need to replicate the first OLS model from page 42 in An Advanced Guide to Trade Policy Analysis (see the solutions manual).

The model specification is: \[\begin{align*} \log X_{ij,t} =& \:\beta_0 + \beta_1 DIST_{i,j} + \beta_2 CNTG_{i,j} + \beta_3 LANG_{i,j} + \beta_4 CLNY_{i,j} + \beta_5 \log Y_{i,t} +\\ \text{ }& \:\beta_6 \log E_{j,t} + \varepsilon_{ij,t} \end{align*}\]

where all the variables, except \(Y,E\) are in the dataset included within the package

> tradepolicy::agtpa_applications
# A tibble: 99,981 x 17
   exporter importer pair_id  year    trade   dist  cntg  lang  clny   rta rta_lag3 rta_lag4 rta_lag6 rta_lag8
   <chr>    <chr>      <dbl> <dbl>    <dbl>  <dbl> <dbl> <dbl> <dbl> <dbl>    <dbl>    <dbl>    <dbl>    <dbl>
 1 ARG      ARG        12339  1986 61289.     534.     0     0     0     0        0        0        0        0
 2 ARG      AUS            1  1986    27.8  12045.     0     0     0     0        0        0        0        0
 3 ARG      AUT            2  1986     3.56 11751.     0     0     0     0        0        0        0        0
 4 ARG      BEL            4  1986    96.1  11305.     0     0     0     0        0        0        0        0
 5 ARG      BGR            3  1986     3.13 12116.     0     0     0     0        0        0        0        0
 6 ARG      BOL            6  1986    52.7   1866.     1     1     0     0        0        0        0        0
 7 ARG      BRA            8  1986   405.    2392.     1     0     0     0        0        0        0        1
 8 ARG      CAN           10  1986    48.3   9391.     0     0     0     0        0        0        0        0
 9 ARG      CHE           12  1986    23.6  11233.     0     0     0     0        0        0        0        0
10 ARG      CHL           14  1986   109.    1157.     1     1     0     0        0        0        0        1
# … with 99,971 more rows, and 3 more variables: rta_lag9 <dbl>, rta_lag12 <dbl>, rta_lead4 <dbl>

What one would do to replicate the results is to follow the steps from the book but in R:

Load the tradepolicy package to have the book’s data
Filter observations for a range of years (1986, 1990, 1994, 1998, 2002 and 2006)
Transform some variables to logarithm scale (trade and dist) and create new variables from those in the original dataset
Remove cases where both the exporter and the importer are the same
Drop observations where the trade flow is zero

The first point would consist in

install.packages("tradepolicy")
library(tradepolicy)

Tradepolicy shall also call dplyr and others when it’s loaded, and dplyr provides the %>% operator to chain operations. To complete point 2 with dplyr (much easier than base R) you can do this:

ch1_application1 <- agtpa_applications %>% 
  filter(year %in% seq(1986, 2006, 4))

Then step 3 is similar (see the book for the definition of \(Y,E\)):

ch1_application1 <- ch1_application1 %>% 
  mutate(
    log_trade = log(trade),
    log_dist = log(dist)
  )
  
ch1_application1 <- ch1_application1 %>% 
  # Create Yit
  group_by(exporter, year) %>% 
  mutate(
    y = sum(trade),
    log_y = log(y)
  ) %>% 
  
  # Create Eit
  group_by(importer, year) %>% 
  mutate(
    e = sum(trade),
    log_e = log(e)
  )

Step 4 and 5 are more of the same is very similar:

ch1_application1 <- ch1_application1 %>% 
  filter(exporter != importer, trade > 0)

Now I’m ready to fit the model, but instead of the steps from the solutions manual, I’ll use fixest to do it by not using the default functions in tradepolicy:

model1 <- feols(
  log_trade ~ log_dist + cntg + lang + clny + log_y + log_e,
  data = ch1_application1,
  cluster = ~ pair_id
)

As you can notice in the code, I specified a clustering variable for the standard errors. R has a default function for linear models, lm, but it requires to use additional packages to cluster.

Let’s explore the model output:

> model1
OLS estimation, Dep. Var.: log_trade
Observations: 25,689 
Standard-errors: Clustered (pair_id) 
              Estimate Std. Error  t value Pr(>|t|))    
(Intercept) -11.283000   0.295827 -38.1410 < 2.2e-16 ***
log_dist     -1.001600   0.027340 -36.6350 < 2.2e-16 ***
cntg          0.573805   0.184710   3.1065  0.001916 ** 
lang          0.801548   0.082102   9.7629 < 2.2e-16 ***
clny          0.734853   0.144193   5.0963  3.74e-07 ***
log_y         1.190200   0.009456 125.8700 < 2.2e-16 ***
log_e         0.907588   0.009910  91.5850 < 2.2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
RMSE: 1.7427   Adj. R2: 0.758469

This is exactly the same result from the book!

Now let’s do it with lm and some reinforcements, sandwich for clustered std. errors and lmtest for coefficient’s test:

install.packages(c("sandwich","lmtest"))
library(sandwich)
library(lmtest)

model2 <- lm(
  log_trade ~ log_dist + cntg + lang + clny + log_y + log_e,
  data = ch1_application1
)

vcov_cluster <- sandwich::vcovCL(
  model2,
  cluster = ch1_application1 %>% 
    filter(trade > 0) %>% 
    ungroup() %>% 
    select(pair_id) %>% 
    pull()
)

coef_test <- coeftest(
  model2,
  vcov_cluster[
    rownames(vcov_cluster),
    rownames(vcov_cluster)
  ]
)

Now we can see coef_test:

> coef_test

t test of coefficients:

               Estimate  Std. Error  t value  Pr(>|t|)    
(Intercept) -11.2830801   0.2958274 -38.1408 < 2.2e-16 ***
log_dist     -1.0016075   0.0273400 -36.6353 < 2.2e-16 ***
cntg          0.5738051   0.1847098   3.1065  0.001895 ** 
lang          0.8015482   0.0821017   9.7629 < 2.2e-16 ***
clny          0.7348535   0.1441929   5.0963 3.488e-07 ***
log_y         1.1902357   0.0094560 125.8716 < 2.2e-16 ***
log_e         0.9075884   0.0099098  91.5846 < 2.2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Of course, you can continue the example and build on top of the second model until obtaining a full summary like this but reporting clustered standard errors:

> summary(model2)

Call:
lm(formula = log_trade ~ log_dist + cntg + lang + clny + log_y + 
    log_e, data = ch1_application1 %>% filter(trade > 0))

Residuals:
     Min       1Q   Median       3Q      Max 
-14.5421  -0.8281   0.1578   1.0476   7.6585 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept) -11.283080   0.151732  -74.36  < 2e-16 ***
log_dist     -1.001607   0.014159  -70.74  < 2e-16 ***
cntg          0.573805   0.074427    7.71 1.31e-14 ***
lang          0.801548   0.033748   23.75  < 2e-16 ***
clny          0.734853   0.070387   10.44  < 2e-16 ***
log_y         1.190236   0.005402  220.32  < 2e-16 ***
log_e         0.907588   0.005577  162.73  < 2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 1.743 on 25682 degrees of freedom
Multiple R-squared:  0.7585,    Adjusted R-squared:  0.7585 
F-statistic: 1.345e+04 on 6 and 25682 DF,  p-value: < 2.2e-16

As you can see, fixest helps a lot, which is what packages do.

Using the R-Universe

The R-Universe, created by Jeroen Ooms, provides a very simple way to create personal CRAN-like repos, which means a way to show your collection of tools in use to the community.

In addition, you can use it to publish articles by using rmarkdown, an R package that allows you to write text and code, and generate a PDF document even with the templates required for the journal where you want to publish.

To install leontief or any other package from my R-Universe section, you can do this:

# Install new leontief version
install.packages("leontief", repos = "https://pachadotdev.r-universe.dev")

Here are some universes that you can add permanently to your configuration:

# Enable some universes
options(repos = c(
    pachadotdev = 'https://pachadotdev.r-universe.dev',
    tidyverse = 'https://github.com/r-universe/tidyverse',
    rlib = 'https://github.com/r-universe/r-lib',
    tidymodels = 'https://github.com/r-universe/tidymodels',
    rspatial = 'https://github.com/r-universe/r-spatial'
    CRAN = 'https://cloud.r-project.org'))

Creating your own R-Universe

This is particularly useful if you teach courses and you provide, for example, data packages for your students, or if you have packages that you don’t mind sending to CRAN.

In order to join the R-Universe, you need a GitHub profile, and a very good reference to start with Git and GitHub is Happy Git and GitHub for the useR.

For example, tradepolicy is a repository where I have all the codes to reproduce the results from [An Advanced Guide to Trade Policy Analysis]. It is of my interest to list the R packages used there, and in other repositories, so that in the R-Universe other users can easily discover the tools I use.

When you visit the R-Universe you’ll see this landing where you have to click ‘Setup’.

Then you have to select you personal profile (or an organizational account if you have authorization).

Now you can chose all your repositories or just a few. I’ll pick just gravity and agtpa, and a few others and then I clicked ‘Install’.

You’ll be asked to confirm R-Universe access to your repositories.

Once it’s ready, you’ll see this

And the URL to your repository will be of the form

https://githubusername.r-universe.dev/
(https://pachadotdev.r-universe.dev/ in my case)

A few minutes after setting it up, the site shall look like this:

Please notice that the repositories that I’ve added are R packages, you can just add a list of packages (i.e. ‘I use dplyr, haven, etc’) to R-Universe if you create a repository with a packages.json file.

This is how to add a list of your favourite packages to the R-Universe:

Go to github.com and create a new repository

Name your repository as “r-universe” (or any other name) and set is as public and with a suitable license (I like Apache license) and proceed with the steps from the image. The specific steps are

Create a new file packages.json containing your packages

Here’s the template text that I used:

[
  {
    "package": "tradepolicy",
    "url": "https://github.com/pachadotdev/tradepolicy"
  },
  {
    "package": "leontief",
    "url": "https://github.com/pachadotdev/leontief"
  },
  {
    "package": "tradestatistics",
    "url": "https://github.com/ropensci/tradestatistics"
  },
  {
    "package": "gravity",
    "url": "https://github.com/pachadotdev/gravity"
  },
  {
    "package": "arrow",
    "url": "https://github.com/apache/arrow"
  },
  {
    "package": "RVAideMemoire",
    "url": "https://github.com/pachadotdev/RVAideMemoire"
  }
]