Site icon R-bloggers

Analyzing Professional Sports Team Colors with R

[This article was first published on r on Tony ElHabr, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

When working with the ggplot2 package, I often find myself playing around with colors for longer than I probably should be. I think that this is because I know that the right color scheme can greatly enhance the information that a plot portrays; and, conversely, choosing an uncomplimentary palette can suppress the message of an otherwise good visualization.

With that said, I wanted to take a look at the presence of colors in the sports realm. I think some fun insight can be had from an exploration of colors used by individual sports teams. Some people have done some interesting technical research on this topic, such as studying the possible effects of color on fan and player perception of teams.

Setup

Technical Notes

library("dplyr")
# library("teamcolors")
library("ggplot2")
# library("tidyr")
# library("tibble")
# library("purrr")
# library("stringr")
# library("stringi")
# library("nbastatR")
# library("UpSetR")
# library("factoextra")
# library("NbClust")
# library("corrr")
# library("viridis")
# library("igraph")
# library("ggraph")
# library("circlize")
# library("colorscience")

The data that I’ll use comes from the teamcolors R package, which itself is sourced from Jim Nielsen’s website for team colors. This data set provides color information for all teams from six professional sports leagues:

teamcolors::teamcolors %>% create_kable()
name league primary secondary tertiary quaternary
AFC Bournemouth epl #e62333 #000000 NA NA
Anaheim Ducks nhl #010101 #a2aaad #fc4c02 #85714d
Arizona Cardinals nfl #97233f #000000 #ffb612 #a5acaf
Arizona Coyotes nhl #010101 #862633 #ddcba4 NA
Arizona Diamondbacks mlb #a71930 #000000 #e3d4ad NA
Arsenal epl #ef0107 #023474 #9c824a NA
Atlanta Braves mlb #ce1141 #13274f NA NA
Atlanta Falcons nfl #a71930 #000000 #a5acaf #a30d2d
Atlanta Hawks nba #e13a3e #c4d600 #061922 NA
Atlanta United FC mls #a29061 #80000b #000000 NA
1 # of total rows: 165

Putting this data in a “tidy” format is rather straightforward. 1 2

colors_tidy <-
  teamcolors::teamcolors %>%
  tidyr::gather(ord, hex, -name, -league)
colors_tidy %>% create_kable()
name league ord hex
AFC Bournemouth epl primary #e62333
Anaheim Ducks nhl primary #010101
Arizona Cardinals nfl primary #97233f
Arizona Coyotes nhl primary #010101
Arizona Diamondbacks mlb primary #a71930
Arsenal epl primary #ef0107
Atlanta Braves mlb primary #ce1141
Atlanta Falcons nfl primary #a71930
Atlanta Hawks nba primary #e13a3e
Atlanta United FC mls primary #a29061
1 # of total rows: 660

Exploration

To begin, here’s visualization of all the colors in this data set. Not much significance can be extracted from this plot, but it’s still nice to have as a mechanism for getting familiar with the data.

Color Brightness

Note that there are quite a few teams without a full set of four colors (and some without a third or even second color).

colors_pct_nas <-
  colors_tidy %>%
  count(league, is_na = is.na(hex), sort = TRUE) %>%
  filter(is_na) %>%
  select(-is_na) %>%
  left_join(
    teamcolors::teamcolors %>%
      count(league, sort = TRUE) %>%
      rename(total = n) %>%
      mutate(total = as.integer(4 * total)),
    by = "league"
  ) %>%
  mutate(n_pct = 100 * n / total) %>% 
  mutate_if(is.numeric, funs(round(., 2)))
colors_pct_nas %>% create_kable()
league n total n_pct
mlb 47 120 39.17
nhl 42 124 33.87
epl 34 80 42.50
mls 20 88 22.73
nba 19 120 15.83
nfl 2 128 1.56

Both the visualization and the tabulation indicate that the MLB is missing the most colors (on a per-team basis). Perhaps this suggests that it is the most “dull” sports league. 3 The NFL is on the other end of the spectrum (pun intended), with only 1.5% of missing color values. Is it a coincidence that the NFL is the most popular sport in the U.S.? 4

My subjective indictment of MLB as dull is certainly unfair and unquantitative. Does “dull” refer to hue, lightness, brightness, etc.? For the sake of argument, let’s say that I want to interpret dullness as “brightness”, which, in the color lexicon, is interpreted as the arithmetic mean of the red-green-blue (RGB) values of a color. To rank the leagues by brightness, I can take the average of the RGB values (derived from the hex values) across all colors for all teams in each league. The resulting values–where a lower value indicates a darker color, and a higher value indicates a brighter color–provide a fair measure upon which each league’s aggregate color choices can be judged.

But first, the reader should be aware of a couple of more technicalities: 5

add_rgb_cols <- function(data) {
  data %>%
    pull(hex) %>%
    grDevices::col2rgb() %>%
    t() %>%
    tibble::as_tibble() %>%
    bind_cols(data, .) 
}

rank_leagues_byrgb <- function(data = NULL) {
  colors_rgb <-
    data %>%
    add_rgb_cols() %>%
    select(-hex) %>%
    arrange(league, name)
  
  colors_rgb_bynm_bylg <-
    colors_rgb %>%
    mutate_at(vars(red, green, blue), funs(. / 255)) %>%
    group_by(name, league) %>%
    summarize_at(vars(red, green, blue), funs(mean)) %>%
    ungroup() %>%
    tidyr::gather(rgb, value, red, green, blue) %>%
    group_by(name, league) %>%
    summarize_at(vars(value), funs(mean, sd)) %>%
    ungroup() %>%
    arrange(league, mean)
  
  colors_rgb_bylg <-
    colors_rgb_bynm_bylg %>%
    group_by(league) %>%
    summarize_at(vars(mean, sd), funs(mean)) %>%
    ungroup() %>%
    arrange(mean)
  colors_rgb_bylg
}

convert_dec2pct <- function(x) {
  100 * round(x, 4)
}

colors_tidy_nona <-
  colors_tidy %>%
  filter(!is.na(hex))


colors_tidy_nona %>% 
  rank_leagues_byrgb() %>%
  arrange(mean) %>% 
  mutate_if(is.numeric, funs(convert_dec2pct)) %>% 
  create_kable()
league mean sd
nhl 30.46 14.52
nfl 32.90 12.68
mlb 33.75 13.16
epl 36.56 15.79
mls 38.59 12.05
nba 40.99 10.37

This calculation proves what we might have guessed by inspection–the NHL actually has the darkest colors. In fact, it seems that the NHL’s “darkness” is most prominent in the primary colors of the teams in the league.

colors_tidy_nona %>% 
  filter(ord == "primary") %>% 
  rank_leagues_byrgb() %>% 
  arrange(mean) %>% 
  mutate_if(is.numeric, funs(convert_dec2pct)) %>% 
  create_kable()
league mean sd
nhl 9.13 9.20
nfl 23.42 18.69
mlb 29.90 29.00
mls 32.16 22.74
epl 37.35 29.10
nba 37.95 30.10

On the other hand, the NBA and the two soccer leagues (the MLS and the EPL) stand out as the leagues with the “brightest” colors.

Finally, just by inspection, it seems like their is an unusual pattern where a disproportionate number of teams in the MLS, NBA, and NFL have shades of gray as their tertiary colors. Using the same function as before, it can be shown indirectly via relatively small standard deviation values that there is not much variation in this color.

colors_tidy_nona %>% 
  filter(ord == "tertiary") %>% 
  rank_leagues_byrgb() %>%
  arrange(sd) %>% 
  mutate_if(is.numeric, funs(convert_dec2pct)) %>% 
  create_kable()
league mean sd
nba 34.23 9.39
nfl 36.98 12.48
mls 48.81 14.13
epl 49.57 23.89
nhl 42.10 28.15
mlb 56.64 29.93

Common Colors

Using a slightly customized version of the plotrix::color.id() function, I can attempt to identify common colors (by name) from the hex values.

# Reference: plotrix::color.id
color_id <- function(hex, set = grDevices::colors()) {
  c2 <- grDevices::col2rgb(hex)
  coltab <- grDevices::col2rgb(set)
  cdist <- apply(coltab, 2, function(z) sum((z - c2)^2))
  set[which(cdist == min(cdist))]
}

identify_color_name <- function(col = NULL, set = grDevices::colors()) {
  col %>%
    # purrr::map(plotrix::color.id) %>% 
    purrr::map(~color_id(.x, set)) %>% 
    purrr::map_chr(~.[1]) %>% 
    stringr::str_replace_all("[0-9]", "")
}

I’ll bin the possible colors into a predefined set. (If a binning strategy is not implemented, one ends up with a more sparse, less meaningful grouping of colors.) This set consists of the “rainbow” colors, as well as black, white, and two shades of grey.

colors_rnbw_hex <-
  c(
    stringr::str_replace_all(grDevices::rainbow(16), "FF$", ""),
    "#FFFFFF",
    "#EEEEEE",
    "#AAAAAA",
    "#000000"
  )
colors_rnbw <- identify_color_name(colors_rnbw_hex)

Now, with the setup out of the way, I can easily compute the names of each color and identify the most common colors overall, as well as the most common primary and secondary colors.

add_color_nm_col <- function(data, rename = TRUE) {
  out <-
    data %>%
    pull(hex) %>%
    identify_color_name(set = colors_rnbw) %>% 
    tibble::as_tibble() %>% 
    bind_cols(data, .)
  
  if(rename) {
    out <-
      out %>% 
      rename(color_nm = value)
  }
  out
}

colors_named <-
  colors_tidy_nona %>%
  add_color_nm_col()

colors_named %>%
  count(color_nm, sort = TRUE) %>% 
  create_kable()
color_nm n
black 173
red 61
darkgray 55
yellow 43
gray 36
darkgoldenrod 35
orangered 33
blue 26
deepskyblue 18
mediumspringgreen 4
1 # of total rows: 15
colors_named %>%
  count(ord, color_nm, sort = TRUE) %>% 
  filter(ord %in% c("primary", "secondary")) %>% 
  group_by(ord) %>% 
  mutate(rank_byord = row_number(desc(n))) %>% 
  do(head(., 5)) %>% 
  create_kable()
ord color_nm n rank_byord
primary black 75 1
primary red 26 2
primary blue 16 3
primary orangered 14 4
primary deepskyblue 6 5
secondary black 46 1
secondary red 19 2
secondary yellow 19 3
secondary darkgray 17 4
secondary gray 12 5

Of course, a visualization is always appreciated.

ords <- ord_nums %>% pull(ord)
color_nm_na <- "none"
colors_named_compl <-
  colors_named %>% 
  mutate(ord = factor(ord, levels = ords)) %>% 
  select(-hex, -league) %>% 
  tidyr::complete(name, ord, fill = list(color_nm = color_nm_na)) %>%
  tidyr::spread(ord, color_nm)

colors_named_compl_ord2 <- 
  colors_named_compl %>%
  filter(primary != secondary) %>% 
  count(primary, secondary, sort = TRUE) %>%
  filter(primary != "none") %>% 
  filter(secondary != "none")

colors_named_compl_ord2_ig <- 
  colors_named_compl_ord2 %>% 
  igraph::graph_from_data_frame()

igraph::V(colors_named_compl_ord2_ig)$node_label <- names(igraph::V(colors_named_compl_ord2_ig))
igraph::V(colors_named_compl_ord2_ig)$node_size <- igraph::degree(colors_named_compl_ord2_ig)

lab_title_colors_named <- paste0("Colors", lab_base_suffix)
lab_subtitle_colors_named <- paste0("Relationships Among Primary and Secondary Colors")

# Reference: https://rud.is/books/21-recipes/visualizing-a-graph-of-retweet-relationships.html.
viz_colors_named <-
  colors_named_compl_ord2_ig %>%
  ggraph::ggraph(layout = "linear", circular = TRUE) +
  # ggraph::ggraph(layout = "kk") +
  # ggraph::ggraph() +
  # ggraph::geom_edge_arc() +
  ggraph::geom_edge_arc(
    aes(
      edge_width = n / 3,
      # edge_alpha = n
      start_cap = ggraph::label_rect(node1.name, padding = margin(5, 5, 5, 5)),
      end_cap = ggraph::label_rect(node2.name, padding = margin(5, 5, 5, 5))
      )
    ) +
  ggraph::geom_node_text(aes(label = node_label)) +
  coord_fixed() +
  # teplot::theme_te()
  # ggraph::theme_graph(base_family = "Arial") +
  theme_void() +
  theme(plot.title = element_text(face = "bold", size = 16), plot.subtitle = element_text(size = 12)) +
  labs(title = lab_title_colors_named, subtitle = lab_subtitle_colors_named)
# viz_colors_named

Additionally, given the “set” nature of the data set, I think that the {UpSetR} package can be used to create an intersection-style graph. 6

Neglecting the color black, which is unsurprisingly the most common color, red has the highest count. (Consequently, it is deserving of use as the fill for the bars in the following plot). 7 On the other hand, it’s a bit unsuprising to me that blue, nor its its brethren in cyan and deepskyblue, isn’t among the top 2 or 3. One might argue that the three shades of blue inherently cause classification to be “less focused”, but this does not seem to curb the prominence of red, which also has two sister colors in orangered and darkpink.

Color Clustering

Aside from set analysis, this data set seems prime for some unsupervised learning, and, more specifically, clustering. While analysis of the colors using RGB values as features can be done (and is actually what I tried initially), the results are not as interpretable as I would like them to be due to the “>2-dimensionality” nature of such an approach.

Thus, as an alternative to RGB components, I determined that “hue” serves as a reasonable all-in-one measure of the “essence” of a color. It is inherently a radial feature–its value can range from 0 to 360 (where red is 0 green is 120, blue is 240). 8

Then, by limiting the color sets to just the primary and secondary colors (such that there are only 2 features), I create a setting that allows the clustering results to be interpeted (and visualized) in a relatively direct manner.

With my setup decided upon, I implement a a tidy pipeline for statistical analysis–making heavy use of David Robinson’s {broom} package–to explore various values of k for a kmeans model. (The package’s kmeans vignette provides a really helpful example.)

While this visualization is fairly informative, it doesn’t quite pinpoint exactly which value of k is “most optimal”. There are [various methods for determining the optimal k-value for a kmeans model] (http://www.sthda.com/english/articles/29-cluster-validation-essentials/96-determining-the-optimal-number-of-clusters-3-must-know-methods/), one of which is the “Elbow” method. Basically, the point is to plot the within-cluster sum of squares (WSS) (i.e. variance) for each value of k–which is typically monotonically decreasing with increasing k–and pick the value of k that corresponds to the “bend” in the plot.

For those who enjoy calculus, the k value for which the second derivative of the curve is minimized is the optimal value (by the Elbow method).

kms_metrics$tot.withinss[1:8] %>% 
  diff(differences = 2) %>% 
  which.min() + 
  2

## [1] 7

To complement this conclusion, I can use the fviz_nbclust() function fom the factoextra package. It deduces the optimal k value by the consensus of different methods.

It’s nice to see that this method comes to the same conclusion.

I’ll continue this analysis in a separate write-up. Unfortunately (or, perhaps, fortunately) there was too much to fit in a single document without it feeling overwhelming.



  1. The fact that the data comes in an easy-to-work-with format comes as a relief to those of us used to having to clean raw data tediously. ^
  2. Note that I use the name ord to represent “ordinality” of the color–that is, primary, secondary, tertiary, or quaternary. ^
  3. In fact, the current consensus among sports fans is that the MLB has a decaying fan-base in the U.S. because it is failing to attract younger fans. This opinion is typically based on conjectures about the game’s slow pace, but, who knows, maybe colors also has something to do with it! (I’m only kidding. I pride myself in guarding against the correlation-equals-causation fallacy.) ^
  4. Again, in case you think I’m serious, let me be clear–yes, it is most likely a coincidence. ^
  5. This probably should be a footnote :). ^
  6. After learning about the this package recently, I’m glad to finally have a use-case to use it! ^
  7. Unfortunately, it seems that customizing the colors for each set is not straightforward, so I do attempt it. ^
  8. Red is also 360. ^

To leave a comment for the author, please follow the link and comment on their blog: r on Tony ElHabr.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.