Site icon R-bloggers

My Personal R Package

[This article was first published on R on Luke DiMartino, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

The semester is over, so what better time to get to work on my personal R package, ladtools. Most of the functions are simple wrappers that improve my workflow.

Find the package on my GitHub here.

geom_lm() and scatter()

I fit a lot of linear regressions and I do not enjoy the base::plot() and base::abline() syntax for quick visualization. So instead, I built two functions, geom_lm() and scatter().

geom_lm() is a wrapper for geom_smooth() with nicer defaults. Instead of fitting a LOESS model, it fits simple OLS, does not plot standard errors, and does not return that pesky warning when the formula is not declared.

ggplot(midwest) +
  aes(x = percollege, y = percbelowpoverty) +
  geom_point() +
  geom_smooth() +
  theme_blog()
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

That warning annoys me to no end.

Instead,

ggplot(midwest) +
  aes(x = percollege, y = percbelowpoverty) +
  geom_point() +
  geom_lm() +
  theme_blog()

scatter() replicates Stata’s scatter command’s most frequent use case: quickly plotting data and a linear trend line through it. scatter() is just ggplot(), geom_point(), and geom_lm() combined into one call.

scatter(midwest, percbelowpoverty, percollege) +
    theme_blog()

is_increasing() and is_decreasing()

These functions are wrappers meant to increase code readability.1 They are mostly self explanatory. The strictly parameter governs whether repeated values count as increasing/decreasing or not. It defaults to FALSE, allowing repeated values.

vec <- c(1, 1, 2, 3)

is_increasing(vec)
## [1] TRUE
is_increasing(vec, strictly = T)
## [1] FALSE

calculate_outlier_value()

In many introductory statistics classes, students are taught that values outside 3 standard deviations from the mean or 1.5 times the interquartile range plus or minus the 75th or 25th percentiles, respectively, are outliers. This test disappears quickly in most statistics curricula (for good reason), but I find it useful to understand the tails of my data, to rapidly test for influential points, and to adjust visualizations.

Boxplots are an excellent visual tool for understanding the distance outliers are from the rest of the data. With larger \(n\), that method begins to fail. For a quick diagnosis, trimming outliers is quite convenient.

Influential point diagnostics exist for many models as well, often involving refitting the model without the outlying point. Trimming does this with all outlying points, a first look into the impact of influential points.

Visualizations also run into problems with outliers, especially with gradient color scales. One outlier can dramatically alter the scale, minimizing the differences between most of the distribution. Filtering or setting outliers to NA is a shortcut that sacrifices little integrity to visualize the majority of the distribution properly.

theme_blog()

Standard ggplot2 visualizations look decent, but anyone publishing graphs for a website or organization should do better. After thousands of graphs, the gray background looks a bit dated, and who decided the standard should be Arial?

Here’s a standard ggplot2 graph:

ggplot2-included theme_bw() cleans up image:

Custom themes are best built on top of a prior theme:

theme_blog <- function() {
  theme_bw(base_size = 11, base_family = "Verdana") %+replace%

The %+replace operator updates the new theme based on theme_bw(). The now matches the site.

Next, theme() arguments specify aspects of the theme. Custom themes are intimidating at first because they are verbose and isolated: they rely on little outside ggplot2. However, a detailed custom theme requires knowing only four functions of the element_ family — element_blank(), element_rect(), element_line(), and element_text() — and margins() to control margins.

First, I make everything behind the plot transparent:

    theme(
      # Make everything transparent
      panel.background = element_blank(),
      plot.background = element_rect(
        fill = "transparent",
        colour = NA
        ),
      legend.key = element_rect(
        fill = "transparent",
        colour = NA
        ),

Next, I eliminate tick marks because they are redundant with panel lines across the entire plots. Without tick marks, the labels along the axes are a

      # Eliminate tick marks
      axis.ticks = element_blank(),

Next, I center and enlarge the title and subtitle.

      # Adjust text elements
      plot.title = element_text(
        size = 16,
        face = "bold",
        hjust = .5, # center align
        vjust = 1,
        margin = margin(t = 8, b = 5)
      ),
      plot.subtitle = element_text(
        size = 12,
        margin = margin(t = 1, b = 5)
      ),
      plot.caption = element_text(
        size = 8,
        hjust = 1
      ),

Since the tick marks are gone, the variable names on the axes and the axis labels need adjustment.

      axis.title = element_text(size = 10),
      axis.text = element_text(size = 9),
      axis.text.x = element_text(
        margin = margin(1, b = 5)
      ),
      axis.text.y = element_text(
        margin = margin(r = .5, l = 5)
      ),

When positioned inside the plot, I appreciate a background for the legend. I override this setting fairly often.

      # Legend settings
      legend.background = element_rect(
        fill = "light gray",
        color = "black",
        size = .3
        ),
      legend.title = element_text(size = 7),
      legend.text = element_text(
        size = 7,
        margin = margin(t = 0, b = 0)
      ),
      legend.key.size = unit(.65, "lines"),

I decided against minor grid lines because they make the plot so busy.

      # Remove minor grid lines
      panel.grid.minor = element_blank()
    )
}

And that’s it! Here is what the graph looks like with theme_blog():

ggplot(mtcars) +
    aes(x = hp, y = mpg) +
    geom_point(aes(color = factor(cyl))) +
    geom_lm() +
    labs(
        x = "Horsepower", 
        y = "Miles Per Gallon", 
        color = "No. of Cylinders",
        caption = "mtcars data",
        title = "A Basic Scatterplot",
        subtitle = "Greater Horsepower Corresponds with Lower Fuel Efficiency") +
    theme_blog() +
    theme(legend.position = c(.8, .8))


  1. In all honesty, I lost points in statistics classes using is.unsorted() because the grader did not understand what was happening. I wrote wrappers months ago and packaged them for convenience.↩︎

To leave a comment for the author, please follow the link and comment on their blog: R on Luke DiMartino.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.