Visualizing the relationship between multiple variables

kjytay

3 years ago

[This article was first published on R – Statistical Odds & Ends, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Visualizing the relationship between multiple variables can get messy very quickly. This post is about how the ggpairs() function in the GGally package does this task, as well as my own method for visualizing pairwise relationships when all the variables are categorical.

For all the code in this post in one file, click here.

The GGally::ggpairs() function does a really good job of visualizing the pairwise relationship for a group of variables. Let’s demonstrate this on a small segment of the vehicles dataset from the fueleconomy package:

library(fueleconomy)
data(vehicles)
df <- vehicles[1:100, ]
str(df)
# Classes ‘tbl_df’, ‘tbl’ and 'data.frame':	100 obs. of  12 variables:
#     $ id   : int  27550 28426 27549 28425 1032 1033 3347 13309 13310 13311 ...
# $ make : chr  "AM General" "AM General" "AM General" "AM General" ...
# $ model: chr  "DJ Po Vehicle 2WD" "DJ Po Vehicle 2WD" "FJ8c Post Office" "FJ8c Post Office" ...
# $ year : int  1984 1984 1984 1984 1985 1985 1987 1997 1997 1997 ...
# $ class: chr  "Special Purpose Vehicle 2WD" "Special Purpose Vehicle 2WD" "Special Purpose Vehicle 2WD" "Special Purpose Vehicle 2WD" ...
# $ trans: chr  "Automatic 3-spd" "Automatic 3-spd" "Automatic 3-spd" "Automatic 3-spd" ...
# $ drive: chr  "2-Wheel Drive" "2-Wheel Drive" "2-Wheel Drive" "2-Wheel Drive" ...
# $ cyl  : int  4 4 6 6 4 6 6 4 4 6 ...
# $ displ: num  2.5 2.5 4.2 4.2 2.5 4.2 3.8 2.2 2.2 3 ...
# $ fuel : chr  "Regular" "Regular" "Regular" "Regular" ...
# $ hwy  : int  17 17 13 13 17 13 21 26 28 26 ...
# $ cty  : int  18 18 13 13 16 13 14 20 22 18 ...

Let’s see how GGally::ggpairs() visualizes relationships between quantitative variables:

library(GGally)
quant_df <- df[, c("cyl", "hwy", "cty")]
ggpairs(quant_df)

Along the diagonal, we see a density plot for each of the variables.
Below the diagonal, we see scatterplots for each pair of variables.
Above the diagonal, we see the (Pearson) correlation between each pair of variables.

The visualization changes a little when we have a mix of quantitative and categorical variables. Below, fuel is a categorical variable while hwy is a quantitative variable.

mixed_df <- df[, c("fuel", "hwy")]
ggpairs(mixed_df)

For a categorical variable on the diagonal, we see a barplot depicting the number of times each category appears.
In one of the corners (top-right), for each categorical value we have a boxplot for the quantitative variable.
In one of the corners (bottom-left), for each categorical value we have a histogram for the quantitative variable.

The only behavior for GGally::ggpairs() we haven’t observed yet is for a pair of categorical variables. In the code fragment below, all 3 variables are categorical:

cat_df <- df[, c("fuel", "make", "drive")]
ggpairs(cat_df)

For each pair of categorical variables, we have a barplot depicting the number of times each pair of categorical values appears.

You may have noticed that the plots above the diagonal are essentially transposes of the plot below the diagonal, and so they don’t really convey any more information. What follows below is my attempt to make the plots above the diagonal more useful. Instead of plotting the transpose barplot, I will plot a heatmap showing the relative proportion of observations having each pair of categorical values.

First, the scaffold for the plot. I will use the gridExtra package to put several ggplot2 objects together. The code below puts the same barplots below the diagonal, variable names along the diagonal, and empty canvases above the diagonal. (Notice that I need some tricks to make the barplots with the variables as strings, namely the use of aes_string() and as.formula() within facet_grid().)

library(gridExtra)
library(tidyverse)
grobs <- list()
idx <- 0
for (i in 1:ncol(cat_df)) {
    for (j in 1:ncol(cat_df)) {
        idx <- idx + 1
        
        # get feature names (note that i & j are reversed)
        x_feat <- names(cat_df)[j]
        y_feat <- names(cat_df)[i]
        
        if (i < j) {
            # empty canvas
            grobs[[idx]] <- ggplot() + theme_void()
        } else if (i == j) {
            # just the name of the variable
            label_df <- data.frame(x = -0, y = 0, label = x_feat)
            grobs[[idx]] <- ggplot(label_df, aes(x = x, y = y, label = label), 
                                   face = "bold", hjust = 0.5) +
                geom_text() +
                coord_cartesian(xlim = c(-1, 1), ylim = c(-1, 1)) +
                theme_void()
        }
        else {
            # 2-dimensional barplot
            grobs[[idx]] <- ggplot(cat_df, aes_string(x = x_feat)) + 
                geom_bar() +
                facet_grid(as.formula(paste(y_feat, "~ ."))) +
                theme(legend.position = "none", axis.title = element_blank())
        }
    }
}
grid.arrange(grobs = grobs, ncol = ncol(cat_df))

This is essentially showing the same information as GGally::ggpairs(). To add the frequency proportion heatmaps, replace the code in the (i < j) branch with the following:

# frequency proportion heatmap
# get frequency proportions
freq_df <- cat_df %>% 
    group_by_at(c(x_feat, y_feat)) %>%
    summarize(proportion = n() / nrow(cat_df)) %>% 
    ungroup()
            
# get all pairwise combinations of values
temp_df <- expand.grid(unique(cat_df[[x_feat]]), 
                       unique(cat_df[[y_feat]]))
names(temp_df) <- c(x_feat, y_feat)
            
# join to get frequency proportion
temp_df <- temp_df %>%
    left_join(freq_df, by = c(setNames(x_feat, x_feat),
                              setNames(y_feat, y_feat))) %>%
    replace_na(list(proportion = 0))
            
grobs[[idx]] <- ggplot(temp_df, aes_string(x = x_feat, y = y_feat)) + 
    geom_tile(aes(fill = proportion)) +
    geom_text(aes(label = sprintf("%0.2f", round(proportion, 2)))) +
    scale_fill_gradient(low = "white", high = "#007acc") +
    theme(legend.position = "none", axis.title = element_blank())

Notice that each heatmap has its own limits for the color scale. If you want to have the same color scale for all the plots, you can add limits = c(0, 1) to the scale_fill_gradient() layer of the plot.

The one thing we lose here over the GGally::ggpairs() version is the marginal barplot for each variable. This is easy to add but then we don’t really have a place to put the variable names. Replacing the code in the (i == j) branch with the following is one possible option.

# df for positioning the variable name
label_df <- data.frame(x = 0.5 + length(unique(cat_df[[x_feat]])) / 2, 
                       y = max(table(cat_df[[x_feat]])) / 2, label = x_feat)
# marginal barplot with variable name on top
grobs[[idx]] <- ggplot(cat_df, aes_string(x = x_feat)) +
    geom_bar() +
    geom_label(data = label_df, aes(x = x, y = y, label = label),
               size = 5)

In this final version, we clean up some of the axes so that more of the plot space can be devoted to the plot itself, not the axis labels:

theme_update(legend.position = "none", axis.title = element_blank())

grobs <- list()
idx <- 0
for (i in 1:ncol(cat_df)) {
    for (j in 1:ncol(cat_df)) {
        idx <- idx + 1
        
        # get feature names (note that i & j are reversed)
        x_feat <- names(cat_df)[j]
        y_feat <- names(cat_df)[i]
        
        if (i < j) {
            # frequency proportion heatmap
            # get frequency proportions
            freq_df <- cat_df %>% 
                group_by_at(c(x_feat, y_feat)) %>%
                summarize(proportion = n() / nrow(cat_df)) %>% 
                ungroup()
            
            # get all pairwise combinations of values
            temp_df <- expand.grid(unique(cat_df[[x_feat]]), 
                                   unique(cat_df[[y_feat]]))
            names(temp_df) <- c(x_feat, y_feat)
            
            # join to get frequency proportion
            temp_df <- temp_df %>%
                left_join(freq_df, by = c(setNames(x_feat, x_feat),
                                          setNames(y_feat, y_feat))) %>%
                replace_na(list(proportion = 0))
            
            grobs[[idx]] <- ggplot(temp_df, aes_string(x = x_feat, y = y_feat)) + 
                geom_tile(aes(fill = proportion)) +
                geom_text(aes(label = sprintf("%0.2f", round(proportion, 2)))) +
                scale_fill_gradient(low = "white", high = "#007acc") +
                theme(axis.ticks = element_blank(), axis.text = element_blank())
        } else if (i == j) {
            # df for positioning the variable name
            label_df <- data.frame(x = 0.5 + length(unique(cat_df[[x_feat]])) / 2, 
                                   y = max(table(cat_df[[x_feat]])) / 2, label = x_feat)
            # marginal barplot with variable name on top
            grobs[[idx]] <- ggplot(cat_df, aes_string(x = x_feat)) +
                geom_bar() +
                geom_label(data = label_df, aes(x = x, y = y, label = label),
                           size = 5)
        }
        else {
            # 2-dimensional barplot
            grobs[[idx]] <- ggplot(cat_df, aes_string(x = x_feat)) + 
                geom_bar() +
                facet_grid(as.formula(paste(y_feat, "~ ."))) +
                theme(axis.ticks.x = element_blank(), axis.text.x = element_blank())
        }
    }
}
grid.arrange(grobs = grobs, ncol = ncol(cat_df))

To leave a comment for the author, please follow the link and comment on their blog: R – Statistical Odds & Ends.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.