How to stop worrying and start using R packages efficiently for Econometrics
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Dedicated to all those who write stupid comments about people with disabilities, LGBT+, non-whites and women, and specially to those who promote false inclusion. As always, this article is a personal opinion and does not involve the people who work with me unless they explicitly agree to this words.
Context
When you do econometrics, using R is not straightforward immediately and for most users it takes time to learn to use it as you would use Stata or SPSS. R has a steep learning curve that we can flatten with a few additions to the vanilla R setup.
If you are just trying R and you come from Stata, you might been already shocked by at least one of these (it happened to me back in 2015):
- Is not easy to compute clustered standard errors for linear models
- Obtaining a Pseudo-R2 for generalized linear models needs additional work
- There are different packages for similar usages
- Not all packages are compatible with each other
Probably you think that R is hard to use, and that it’s transparency comes with that additional difficulty. But if you use packages created by the community, such as the superb tidyverse, then you’ll have access to functions that do not come with R when you install it, but those ease your R work a lot (think about esttout in Stata!).
But how about other packages? sometimes it’s hard to find steps by steps tutorials like ‘In Stata you do X, in R you do Y’. Here I’ll explain a bit of econometrics using R and how to create a collection of packages.
Think about R packages as browser extensions, these serve a particular purpose for a particular group of users. For example, not all Firefox/Chrome users need the Enhancer for YouTube extension as not all users watch YouTube. R default installation is meant to be light and for general usage.
You might have found Stackoverflow and the #rstats hashtag on Twitter where there’s people who help a lot, and there are nice communities such as rOpenSci. But, being R a very large (and growing) group of people, don’t forget it’s a sample from the human race, and therefore you’ll find tutorials written in a language that’s not different from what the intersection of many cults and pseudo-science (not as bad as crypto-bros of course) would write, and there will be people that reply ‘Google it’ when you have a problem that you can’t figure out. Don’t ever get dissapointed, please! Ask the right people.
Just send some shade to those who don’t contribute to a healthy community.
Fortunately, if you belong to an under-represented group or you are just starting with R, there’s an important fraction of the R community who are truly inclusive and do not have low self-esteem to the point of feeling frightened by people who are different than them. Please count on the valuable ones and if you see something, say something.
Installing R packages from CRAN (i.e. the official source)
CRAN, the official server for R packages, does a very good job at providing some packages such as
- dplyr: A fast, consistent tool for working with data frame like objects, both in memory and out of memory.
- ggplot2: A system for ‘declaratively’ creating graphics, based on “The Grammar of Graphics”.
- msm: Multi-State Markov and Hidden Markov Models in Continuous Time
But there are more packages outside CRAN, or that were removed from it due to policy changes, etc. One of these is RVAideMemoire, which provides ‘Testing and Plotting Procedures for Biostatistics’ and which is very helpful to work with generalized linear models.
Dplyr and ggplot2 are largely covered on many other blogs, both are part of the Tidyverse, which is like a bag containing different packages.
You can install R packages with install.packages("tidyverse")
or similar.
When you need to install from other sources, it gets more complicated.
I recommend to install packages from CRAN, unless you need a new functionality, because sometimes unofficial sources introduce experimental changes.
Installing packages from unofficial sources (i.e. GitHub)
As a way to ease packages installation from other sources than CRAN, we have the ‘remotes’ package. For example, the gravity package hasn’t been updated on CRAN due to some policy changes that prevent me from adding some improvements without taking detours. These detours are irrelevant for the end user, who can install the improved gravity version from the package repository.
The easy way would be:
install.packages("remotes") remotes::install_github("pachadotdev/gravity")
And for packages removed from CRAN it’s the same:
install.packages("remotes") remotes::install_github("pachadotdev/RVAideMemoire")
Of course, you can install from different repositories such as lrberge/fixest
instead of install.packages("fixest")
. Btw, fixest is beyond excellent and
it’s very helpful to fit models with large effects.
Fitting models with the fixest package vs using base R
At the beginning of the post I mentioned things that are not easy to do with standard R.
As an example, say we need to replicate the first OLS model from page 42 in An Advanced Guide to Trade Policy Analysis (see the solutions manual).
The model specification is: \[\begin{align*} \log X_{ij,t} =& \:\beta_0 + \beta_1 DIST_{i,j} + \beta_2 CNTG_{i,j} + \beta_3 LANG_{i,j} + \beta_4 CLNY_{i,j} + \beta_5 \log Y_{i,t} +\\ \text{ }& \:\beta_6 \log E_{j,t} + \varepsilon_{ij,t} \end{align*}\]
where all the variables, except \(Y,E\) are in the dataset included within the package
> tradepolicy::agtpa_applications # A tibble: 99,981 x 17 exporter importer pair_id year trade dist cntg lang clny rta rta_lag3 rta_lag4 rta_lag6 rta_lag8 <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> 1 ARG ARG 12339 1986 61289. 534. 0 0 0 0 0 0 0 0 2 ARG AUS 1 1986 27.8 12045. 0 0 0 0 0 0 0 0 3 ARG AUT 2 1986 3.56 11751. 0 0 0 0 0 0 0 0 4 ARG BEL 4 1986 96.1 11305. 0 0 0 0 0 0 0 0 5 ARG BGR 3 1986 3.13 12116. 0 0 0 0 0 0 0 0 6 ARG BOL 6 1986 52.7 1866. 1 1 0 0 0 0 0 0 7 ARG BRA 8 1986 405. 2392. 1 0 0 0 0 0 0 1 8 ARG CAN 10 1986 48.3 9391. 0 0 0 0 0 0 0 0 9 ARG CHE 12 1986 23.6 11233. 0 0 0 0 0 0 0 0 10 ARG CHL 14 1986 109. 1157. 1 1 0 0 0 0 0 1 # … with 99,971 more rows, and 3 more variables: rta_lag9 <dbl>, rta_lag12 <dbl>, rta_lead4 <dbl>
What one would do to replicate the results is to follow the steps from the book but in R:
- Load the tradepolicy package to have the book’s data
- Filter observations for a range of years (1986, 1990, 1994, 1998, 2002 and 2006)
- Transform some variables to logarithm scale (trade and dist) and create new variables from those in the original dataset
- Remove cases where both the exporter and the importer are the same
- Drop observations where the trade flow is zero
The first point would consist in
install.packages("tradepolicy") library(tradepolicy)
Tradepolicy shall also call dplyr and others when it’s loaded, and dplyr
provides the %>%
operator to chain operations. To complete point 2 with
dplyr (much easier than base R) you can do this:
ch1_application1 <- agtpa_applications %>% filter(year %in% seq(1986, 2006, 4))
Then step 3 is similar (see the book for the definition of \(Y,E\)):
ch1_application1 <- ch1_application1 %>% mutate( log_trade = log(trade), log_dist = log(dist) ) ch1_application1 <- ch1_application1 %>% # Create Yit group_by(exporter, year) %>% mutate( y = sum(trade), log_y = log(y) ) %>% # Create Eit group_by(importer, year) %>% mutate( e = sum(trade), log_e = log(e) )
Step 4 and 5 are more of the same is very similar:
ch1_application1 <- ch1_application1 %>% filter(exporter != importer, trade > 0)
Now I’m ready to fit the model, but instead of the steps from the solutions manual, I’ll use fixest to do it by not using the default functions in tradepolicy:
model1 <- feols( log_trade ~ log_dist + cntg + lang + clny + log_y + log_e, data = ch1_application1, cluster = ~ pair_id )
As you can notice in the code, I specified a clustering variable for the
standard errors. R has a default function for linear models, lm
, but it
requires to use additional packages to cluster.
Let’s explore the model output:
> model1 OLS estimation, Dep. Var.: log_trade Observations: 25,689 Standard-errors: Clustered (pair_id) Estimate Std. Error t value Pr(>|t|)) (Intercept) -11.283000 0.295827 -38.1410 < 2.2e-16 *** log_dist -1.001600 0.027340 -36.6350 < 2.2e-16 *** cntg 0.573805 0.184710 3.1065 0.001916 ** lang 0.801548 0.082102 9.7629 < 2.2e-16 *** clny 0.734853 0.144193 5.0963 3.74e-07 *** log_y 1.190200 0.009456 125.8700 < 2.2e-16 *** log_e 0.907588 0.009910 91.5850 < 2.2e-16 *** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 RMSE: 1.7427 Adj. R2: 0.758469
This is exactly the same result from the book!
Now let’s do it with lm
and some reinforcements, sandwich for clustered
std. errors and lmtest for coefficient’s test:
install.packages(c("sandwich","lmtest")) library(sandwich) library(lmtest) model2 <- lm( log_trade ~ log_dist + cntg + lang + clny + log_y + log_e, data = ch1_application1 ) vcov_cluster <- sandwich::vcovCL( model2, cluster = ch1_application1 %>% filter(trade > 0) %>% ungroup() %>% select(pair_id) %>% pull() ) coef_test <- coeftest( model2, vcov_cluster[ rownames(vcov_cluster), rownames(vcov_cluster) ] )
Now we can see coef_test
:
> coef_test t test of coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -11.2830801 0.2958274 -38.1408 < 2.2e-16 *** log_dist -1.0016075 0.0273400 -36.6353 < 2.2e-16 *** cntg 0.5738051 0.1847098 3.1065 0.001895 ** lang 0.8015482 0.0821017 9.7629 < 2.2e-16 *** clny 0.7348535 0.1441929 5.0963 3.488e-07 *** log_y 1.1902357 0.0094560 125.8716 < 2.2e-16 *** log_e 0.9075884 0.0099098 91.5846 < 2.2e-16 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Of course, you can continue the example and build on top of the second model until obtaining a full summary like this but reporting clustered standard errors:
> summary(model2) Call: lm(formula = log_trade ~ log_dist + cntg + lang + clny + log_y + log_e, data = ch1_application1 %>% filter(trade > 0)) Residuals: Min 1Q Median 3Q Max -14.5421 -0.8281 0.1578 1.0476 7.6585 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -11.283080 0.151732 -74.36 < 2e-16 *** log_dist -1.001607 0.014159 -70.74 < 2e-16 *** cntg 0.573805 0.074427 7.71 1.31e-14 *** lang 0.801548 0.033748 23.75 < 2e-16 *** clny 0.734853 0.070387 10.44 < 2e-16 *** log_y 1.190236 0.005402 220.32 < 2e-16 *** log_e 0.907588 0.005577 162.73 < 2e-16 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 1.743 on 25682 degrees of freedom Multiple R-squared: 0.7585, Adjusted R-squared: 0.7585 F-statistic: 1.345e+04 on 6 and 25682 DF, p-value: < 2.2e-16
As you can see, fixest helps a lot, which is what packages do.
Using the R-Universe
The R-Universe, created by Jeroen Ooms, provides a very simple way to create personal CRAN-like repos, which means a way to show your collection of tools in use to the community.
In addition, you can use it to publish articles by using rmarkdown, an R package that allows you to write text and code, and generate a PDF document even with the templates required for the journal where you want to publish.
To install leontief or any other package from my R-Universe section, you can do this:
# Install new leontief version install.packages("leontief", repos = "https://pachadotdev.r-universe.dev")
Here are some universes that you can add permanently to your configuration:
# Enable some universes options(repos = c( pachadotdev = 'https://pachadotdev.r-universe.dev', tidyverse = 'https://github.com/r-universe/tidyverse', rlib = 'https://github.com/r-universe/r-lib', tidymodels = 'https://github.com/r-universe/tidymodels', rspatial = 'https://github.com/r-universe/r-spatial' CRAN = 'https://cloud.r-project.org'))
Creating your own R-Universe
This is particularly useful if you teach courses and you provide, for example, data packages for your students, or if you have packages that you don’t mind sending to CRAN.
In order to join the R-Universe, you need a GitHub profile, and a very good reference to start with Git and GitHub is Happy Git and GitHub for the useR.
For example, tradepolicy is a repository where I have all the codes to reproduce the results from [An Advanced Guide to Trade Policy Analysis]. It is of my interest to list the R packages used there, and in other repositories, so that in the R-Universe other users can easily discover the tools I use.
When you visit the R-Universe you’ll see this landing where you have to click ‘Setup’.
Then you have to select you personal profile (or an organizational account if you have authorization).
Now you can chose all your repositories or just a few. I’ll pick just gravity and agtpa, and a few others and then I clicked ‘Install’.
You’ll be asked to confirm R-Universe access to your repositories.
Once it’s ready, you’ll see this
And the URL to your repository will be of the form
https://githubusername.r-universe.dev/ (https://pachadotdev.r-universe.dev/ in my case)
A few minutes after setting it up, the site shall look like this:
Please notice that the repositories that I’ve added are R packages, you can
just add a list of packages (i.e. ‘I use dplyr, haven, etc’) to R-Universe
if you create a repository with a packages.json
file.
This is how to add a list of your favourite packages to the R-Universe:
- Go to github.com and create a new repository
- Name your repository as “r-universe” (or any other name) and set is as public and with a suitable license (I like Apache license) and proceed with the steps from the image. The specific steps are
- Create a new file
packages.json
containing your packages
Here’s the template text that I used:
[ { "package": "tradepolicy", "url": "https://github.com/pachadotdev/tradepolicy" }, { "package": "leontief", "url": "https://github.com/pachadotdev/leontief" }, { "package": "tradestatistics", "url": "https://github.com/ropensci/tradestatistics" }, { "package": "gravity", "url": "https://github.com/pachadotdev/gravity" }, { "package": "arrow", "url": "https://github.com/apache/arrow" }, { "package": "RVAideMemoire", "url": "https://github.com/pachadotdev/RVAideMemoire" } ]
- If you added a few repositories to R-Universe, add the newly created r-universe repo, otherwise you are ready
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.