Analyzing Professional Sports Team Colors with R

r on Tony ElHabr

4 years ago

[This article was first published on r on Tony ElHabr, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

When working with the ggplot2 package, I often find myself playing around with colors for longer than I probably should be. I think that this is because I know that the right color scheme can greatly enhance the information that a plot portrays; and, conversely, choosing an uncomplimentary palette can suppress the message of an otherwise good visualization.

With that said, I wanted to take a look at the presence of colors in the sports realm. I think some fun insight can be had from an exploration of colors used by individual sports teams. Some people have done some interesting technical research on this topic, such as studying the possible effects of color on fan and player perception of teams.

Setup

Technical Notes

I show code only where I believe it complements the commentary throughout; otherwise, it is hidden. Nonetheless, the underlying code can be viewed in the raw .Rmd file for this write-up.
Although I list all of the packages used in this write-up (for the sake of reproducibility), I comment out those that are used only in an explicit manner (i.e. via the “package::function” syntax). (Only dplyr and ggplot2 are imported altogether). Minimizing the namespace in this manner is a personal convention.

library("dplyr")
# library("teamcolors")
library("ggplot2")
# library("tidyr")
# library("tibble")
# library("purrr")
# library("stringr")
# library("stringi")
# library("nbastatR")
# library("UpSetR")
# library("factoextra")
# library("NbClust")
# library("corrr")
# library("viridis")
# library("igraph")
# library("ggraph")
# library("circlize")
# library("colorscience")

The data that I’ll use comes from the teamcolors R package, which itself is sourced from Jim Nielsen’s website for team colors. This data set provides color information for all teams from six professional sports leagues:

EPL (European futbol),
MLB (baseball),
MLS (American soccer),
NBA (basketball),
NFL (American football), and
NHL (hockey).

teamcolors::teamcolors %>% create_kable()

name	league	primary	secondary	tertiary	quaternary
AFC Bournemouth	epl	#e62333	#000000	NA	NA
Anaheim Ducks	nhl	#010101	#a2aaad	#fc4c02	#85714d
Arizona Cardinals	nfl	#97233f	#000000	#ffb612	#a5acaf
Arizona Coyotes	nhl	#010101	#862633	#ddcba4	NA
Arizona Diamondbacks	mlb	#a71930	#000000	#e3d4ad	NA
Arsenal	epl	#ef0107	#023474	#9c824a	NA
Atlanta Braves	mlb	#ce1141	#13274f	NA	NA
Atlanta Falcons	nfl	#a71930	#000000	#a5acaf	#a30d2d
Atlanta Hawks	nba	#e13a3e	#c4d600	#061922	NA
Atlanta United FC	mls	#a29061	#80000b	#000000	NA
¹ # of total rows: 165

Putting this data in a “tidy” format is rather straightforward. ¹ ²

colors_tidy <-
  teamcolors::teamcolors %>%
  tidyr::gather(ord, hex, -name, -league)
colors_tidy %>% create_kable()

name	league	ord	hex
AFC Bournemouth	epl	primary	#e62333
Anaheim Ducks	nhl	primary	#010101
Arizona Cardinals	nfl	primary	#97233f
Arizona Coyotes	nhl	primary	#010101
Arizona Diamondbacks	mlb	primary	#a71930
Arsenal	epl	primary	#ef0107
Atlanta Braves	mlb	primary	#ce1141
Atlanta Falcons	nfl	primary	#a71930
Atlanta Hawks	nba	primary	#e13a3e
Atlanta United FC	mls	primary	#a29061
¹ # of total rows: 660

Exploration

To begin, here’s visualization of all the colors in this data set. Not much significance can be extracted from this plot, but it’s still nice to have as a mechanism for getting familiar with the data.

Color Brightness

Note that there are quite a few teams without a full set of four colors (and some without a third or even second color).

colors_pct_nas <-
  colors_tidy %>%
  count(league, is_na = is.na(hex), sort = TRUE) %>%
  filter(is_na) %>%
  select(-is_na) %>%
  left_join(
    teamcolors::teamcolors %>%
      count(league, sort = TRUE) %>%
      rename(total = n) %>%
      mutate(total = as.integer(4 * total)),
    by = "league"
  ) %>%
  mutate(n_pct = 100 * n / total) %>% 
  mutate_if(is.numeric, funs(round(., 2)))
colors_pct_nas %>% create_kable()

league	n	total	n_pct
mlb	47	120	39.17
nhl	42	124	33.87
epl	34	80	42.50
mls	20	88	22.73
nba	19	120	15.83
nfl	2	128	1.56

Both the visualization and the tabulation indicate that the MLB is missing the most colors (on a per-team basis). Perhaps this suggests that it is the most “dull” sports league. ³ The NFL is on the other end of the spectrum (pun intended), with only 1.5% of missing color values. Is it a coincidence that the NFL is the most popular sport in the U.S.? ⁴

My subjective indictment of MLB as dull is certainly unfair and unquantitative. Does “dull” refer to hue, lightness, brightness, etc.? For the sake of argument, let’s say that I want to interpret dullness as “brightness”, which, in the color lexicon, is interpreted as the arithmetic mean of the red-green-blue (RGB) values of a color. To rank the leagues by brightness, I can take the average of the RGB values (derived from the hex values) across all colors for all teams in each league. The resulting values–where a lower value indicates a darker color, and a higher value indicates a brighter color–provide a fair measure upon which each league’s aggregate color choices can be judged.

But first, the reader should be aware of a couple of more technicalities: ⁵

I put this computation in a function because I perform the same actions multiple times. This practice complies with the DRY principle.
I was unable to get grDevices::colo2rgb() (and some other custom functions used elsewhere) to work in a vectorized manner, so I created a function (add_rgb_cols()) to do so. I believe the problem is that grDevices::colo2rgb() returns a matrix instead of a single value.
Additionally, despite only using one element in the returned list here, I wrote the function to return a list of results because I was inspecting the different sets of results during code development.
Finally, I re-scale each RGB value to a value between 0 and 1–RGB is typically expressed on a 0 to 255 scale–in order to make the final values more interpretable.

add_rgb_cols <- function(data) {
  data %>%
    pull(hex) %>%
    grDevices::col2rgb() %>%
    t() %>%
    tibble::as_tibble() %>%
    bind_cols(data, .) 
}

rank_leagues_byrgb <- function(data = NULL) {
  colors_rgb <-
    data %>%
    add_rgb_cols() %>%
    select(-hex) %>%
    arrange(league, name)
  
  colors_rgb_bynm_bylg <-
    colors_rgb %>%
    mutate_at(vars(red, green, blue), funs(. / 255)) %>%
    group_by(name, league) %>%
    summarize_at(vars(red, green, blue), funs(mean)) %>%
    ungroup() %>%
    tidyr::gather(rgb, value, red, green, blue) %>%
    group_by(name, league) %>%
    summarize_at(vars(value), funs(mean, sd)) %>%
    ungroup() %>%
    arrange(league, mean)
  
  colors_rgb_bylg <-
    colors_rgb_bynm_bylg %>%
    group_by(league) %>%
    summarize_at(vars(mean, sd), funs(mean)) %>%
    ungroup() %>%
    arrange(mean)
  colors_rgb_bylg
}

convert_dec2pct <- function(x) {
  100 * round(x, 4)
}

colors_tidy_nona <-
  colors_tidy %>%
  filter(!is.na(hex))


colors_tidy_nona %>% 
  rank_leagues_byrgb() %>%
  arrange(mean) %>% 
  mutate_if(is.numeric, funs(convert_dec2pct)) %>% 
  create_kable()

league	mean	sd
nhl	30.46	14.52
nfl	32.90	12.68
mlb	33.75	13.16
epl	36.56	15.79
mls	38.59	12.05
nba	40.99	10.37

This calculation proves what we might have guessed by inspection–the NHL actually has the darkest colors. In fact, it seems that the NHL’s “darkness” is most prominent in the primary colors of the teams in the league.

colors_tidy_nona %>% 
  filter(ord == "primary") %>% 
  rank_leagues_byrgb() %>% 
  arrange(mean) %>% 
  mutate_if(is.numeric, funs(convert_dec2pct)) %>% 
  create_kable()

league	mean	sd
nhl	9.13	9.20
nfl	23.42	18.69
mlb	29.90	29.00
mls	32.16	22.74
epl	37.35	29.10
nba	37.95	30.10

On the other hand, the NBA and the two soccer leagues (the MLS and the EPL) stand out as the leagues with the “brightest” colors.

Finally, just by inspection, it seems like their is an unusual pattern where a disproportionate number of teams in the MLS, NBA, and NFL have shades of gray as their tertiary colors. Using the same function as before, it can be shown indirectly via relatively small standard deviation values that there is not much variation in this color.

colors_tidy_nona %>% 
  filter(ord == "tertiary") %>% 
  rank_leagues_byrgb() %>%
  arrange(sd) %>% 
  mutate_if(is.numeric, funs(convert_dec2pct)) %>% 
  create_kable()

league	mean	sd
nba	34.23	9.39
nfl	36.98	12.48
mls	48.81	14.13
epl	49.57	23.89
nhl	42.10	28.15
mlb	56.64	29.93

Common Colors

Using a slightly customized version of the plotrix::color.id() function, I can attempt to identify common colors (by name) from the hex values.

# Reference: plotrix::color.id
color_id <- function(hex, set = grDevices::colors()) {
  c2 <- grDevices::col2rgb(hex)
  coltab <- grDevices::col2rgb(set)
  cdist <- apply(coltab, 2, function(z) sum((z - c2)^2))
  set[which(cdist == min(cdist))]
}

identify_color_name <- function(col = NULL, set = grDevices::colors()) {
  col %>%
    # purrr::map(plotrix::color.id) %>% 
    purrr::map(~color_id(.x, set)) %>% 
    purrr::map_chr(~.[1]) %>% 
    stringr::str_replace_all("[0-9]", "")
}

I’ll bin the possible colors into a predefined set. (If a binning strategy is not implemented, one ends up with a more sparse, less meaningful grouping of colors.) This set consists of the “rainbow” colors, as well as black, white, and two shades of grey.

colors_rnbw_hex <-
  c(
    stringr::str_replace_all(grDevices::rainbow(16), "FF$", ""),
    "#FFFFFF",
    "#EEEEEE",
    "#AAAAAA",
    "#000000"
  )
colors_rnbw <- identify_color_name(colors_rnbw_hex)

Now, with the setup out of the way, I can easily compute the names of each color and identify the most common colors overall, as well as the most common primary and secondary colors.

add_color_nm_col <- function(data, rename = TRUE) {
  out <-
    data %>%
    pull(hex) %>%
    identify_color_name(set = colors_rnbw) %>% 
    tibble::as_tibble() %>% 
    bind_cols(data, .)
  
  if(rename) {
    out <-
      out %>% 
      rename(color_nm = value)
  }
  out
}

colors_named <-
  colors_tidy_nona %>%
  add_color_nm_col()

colors_named %>%
  count(color_nm, sort = TRUE) %>% 
  create_kable()

color_nm	n
black	173
red	61
darkgray	55
yellow	43
gray	36
darkgoldenrod	35
orangered	33
blue	26
deepskyblue	18
mediumspringgreen	4
¹ # of total rows: 15

colors_named %>%
  count(ord, color_nm, sort = TRUE) %>% 
  filter(ord %in% c("primary", "secondary")) %>% 
  group_by(ord) %>% 
  mutate(rank_byord = row_number(desc(n))) %>% 
  do(head(., 5)) %>% 
  create_kable()

ord	color_nm	n	rank_byord
primary	black	75	1
primary	red	26	2
primary	blue	16	3
primary	orangered	14	4
primary	deepskyblue	6	5
secondary	black	46	1
secondary	red	19	2
secondary	yellow	19	3
secondary	darkgray	17	4
secondary	gray	12	5

Of course, a visualization is always appreciated.

ords <- ord_nums %>% pull(ord)
color_nm_na <- "none"
colors_named_compl <-
  colors_named %>% 
  mutate(ord = factor(ord, levels = ords)) %>% 
  select(-hex, -league) %>% 
  tidyr::complete(name, ord, fill = list(color_nm = color_nm_na)) %>%
  tidyr::spread(ord, color_nm)

colors_named_compl_ord2 <- 
  colors_named_compl %>%
  filter(primary != secondary) %>% 
  count(primary, secondary, sort = TRUE) %>%
  filter(primary != "none") %>% 
  filter(secondary != "none")

colors_named_compl_ord2_ig <- 
  colors_named_compl_ord2 %>% 
  igraph::graph_from_data_frame()

igraph::V(colors_named_compl_ord2_ig)$node_label <- names(igraph::V(colors_named_compl_ord2_ig))
igraph::V(colors_named_compl_ord2_ig)$node_size <- igraph::degree(colors_named_compl_ord2_ig)

lab_title_colors_named <- paste0("Colors", lab_base_suffix)
lab_subtitle_colors_named <- paste0("Relationships Among Primary and Secondary Colors")

# Reference: https://rud.is/books/21-recipes/visualizing-a-graph-of-retweet-relationships.html.
viz_colors_named <-
  colors_named_compl_ord2_ig %>%
  ggraph::ggraph(layout = "linear", circular = TRUE) +
  # ggraph::ggraph(layout = "kk") +
  # ggraph::ggraph() +
  # ggraph::geom_edge_arc() +
  ggraph::geom_edge_arc(
    aes(
      edge_width = n / 3,
      # edge_alpha = n
      start_cap = ggraph::label_rect(node1.name, padding = margin(5, 5, 5, 5)),
      end_cap = ggraph::label_rect(node2.name, padding = margin(5, 5, 5, 5))
      )
    ) +
  ggraph::geom_node_text(aes(label = node_label)) +
  coord_fixed() +
  # teplot::theme_te()
  # ggraph::theme_graph(base_family = "Arial") +
  theme_void() +
  theme(plot.title = element_text(face = "bold", size = 16), plot.subtitle = element_text(size = 12)) +
  labs(title = lab_title_colors_named, subtitle = lab_subtitle_colors_named)
# viz_colors_named

Additionally, given the “set” nature of the data set, I think that the {UpSetR} package can be used to create an intersection-style graph. ⁶

Neglecting the color black, which is unsurprisingly the most common color, red has the highest count. (Consequently, it is deserving of use as the fill for the bars in the following plot). ⁷ On the other hand, it’s a bit unsuprising to me that blue, nor its its brethren in cyan and deepskyblue, isn’t among the top 2 or 3. One might argue that the three shades of blue inherently cause classification to be “less focused”, but this does not seem to curb the prominence of red, which also has two sister colors in orangered and darkpink.

Color Clustering

Aside from set analysis, this data set seems prime for some unsupervised learning, and, more specifically, clustering. While analysis of the colors using RGB values as features can be done (and is actually what I tried initially), the results are not as interpretable as I would like them to be due to the “>2-dimensionality” nature of such an approach.

Thus, as an alternative to RGB components, I determined that “hue” serves as a reasonable all-in-one measure of the “essence” of a color. It is inherently a radial feature–its value can range from 0 to 360 (where red is 0 green is 120, blue is 240). ⁸

Then, by limiting the color sets to just the primary and secondary colors (such that there are only 2 features), I create a setting that allows the clustering results to be interpeted (and visualized) in a relatively direct manner.

With my setup decided upon, I implement a a tidy pipeline for statistical analysis–making heavy use of David Robinson’s {broom} package–to explore various values of k for a kmeans model. (The package’s kmeans vignette provides a really helpful example.)

While this visualization is fairly informative, it doesn’t quite pinpoint exactly which value of k is “most optimal”. There are [various methods for determining the optimal k-value for a kmeans model] (http://www.sthda.com/english/articles/29-cluster-validation-essentials/96-determining-the-optimal-number-of-clusters-3-must-know-methods/), one of which is the “Elbow” method. Basically, the point is to plot the within-cluster sum of squares (WSS) (i.e. variance) for each value of k–which is typically monotonically decreasing with increasing k–and pick the value of k that corresponds to the “bend” in the plot.

For those who enjoy calculus, the k value for which the second derivative of the curve is minimized is the optimal value (by the Elbow method).

kms_metrics$tot.withinss[1:8] %>% 
  diff(differences = 2) %>% 
  which.min() + 
  2

## [1] 7

To complement this conclusion, I can use the fviz_nbclust() function fom the factoextra package. It deduces the optimal k value by the consensus of different methods.

It’s nice to see that this method comes to the same conclusion.

I’ll continue this analysis in a separate write-up. Unfortunately (or, perhaps, fortunately) there was too much to fit in a single document without it feeling overwhelming.

The fact that the data comes in an easy-to-work-with format comes as a relief to those of us used to having to clean raw data tediously. ^{^}
Note that I use the name ord to represent “ordinality” of the color–that is, primary, secondary, tertiary, or quaternary. ^{^}
In fact, the current consensus among sports fans is that the MLB has a decaying fan-base in the U.S. because it is failing to attract younger fans. This opinion is typically based on conjectures about the game’s slow pace, but, who knows, maybe colors also has something to do with it! (I’m only kidding. I pride myself in guarding against the correlation-equals-causation fallacy.) ^{^}
Again, in case you think I’m serious, let me be clear–yes, it is most likely a coincidence. ^{^}
This probably should be a footnote :). ^{^}
After learning about the this package recently, I’m glad to finally have a use-case to use it! ^{^}
Unfortunately, it seems that customizing the colors for each set is not straightforward, so I do attempt it. ^{^}
Red is also 360. ^{^}

To leave a comment for the author, please follow the link and comment on their blog: r on Tony ElHabr.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.