Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
When working with the ggplot2
package, I often
find myself playing around with colors for longer than I probably should
be. I think that this is because I know that the right color scheme can
greatly enhance the information that a plot portrays; and, conversely,
choosing an uncomplimentary palette can suppress the message of an
otherwise good visualization.
With that said, I wanted to take a look at the presence of colors in the sports realm. I think some fun insight can be had from an exploration of colors used by individual sports teams. Some people have done some interesting technical research on this topic, such as studying the possible effects of color on fan and player perception of teams.
Setup
Technical Notes
I show code only where I believe it complements the commentary throughout; otherwise, it is hidden. Nonetheless, the underlying code can be viewed in the raw .Rmd file for this write-up.
Although I list all of the packages used in this write-up (for the sake of reproducibility), I comment out those that are used only in an explicit manner (i.e. via the “
package::function
” syntax). (Onlydplyr
andggplot2
are imported altogether). Minimizing the namespace in this manner is a personal convention.
library("dplyr") # library("teamcolors") library("ggplot2") # library("tidyr") # library("tibble") # library("purrr") # library("stringr") # library("stringi") # library("nbastatR") # library("UpSetR") # library("factoextra") # library("NbClust") # library("corrr") # library("viridis") # library("igraph") # library("ggraph") # library("circlize") # library("colorscience")
The data that I’ll use comes from the teamcolors R package, which itself is sourced from Jim Nielsen’s website for team colors. This data set provides color information for all teams from six professional sports leagues:
- EPL (European futbol),
- MLB (baseball),
- MLS (American soccer),
- NBA (basketball),
- NFL (American football), and
- NHL (hockey).
teamcolors::teamcolors %>% create_kable()
name | league | primary | secondary | tertiary | quaternary |
---|---|---|---|---|---|
AFC Bournemouth | epl | #e62333 | #000000 | NA | NA |
Anaheim Ducks | nhl | #010101 | #a2aaad | #fc4c02 | #85714d |
Arizona Cardinals | nfl | #97233f | #000000 | #ffb612 | #a5acaf |
Arizona Coyotes | nhl | #010101 | #862633 | #ddcba4 | NA |
Arizona Diamondbacks | mlb | #a71930 | #000000 | #e3d4ad | NA |
Arsenal | epl | #ef0107 | #023474 | #9c824a | NA |
Atlanta Braves | mlb | #ce1141 | #13274f | NA | NA |
Atlanta Falcons | nfl | #a71930 | #000000 | #a5acaf | #a30d2d |
Atlanta Hawks | nba | #e13a3e | #c4d600 | #061922 | NA |
Atlanta United FC | mls | #a29061 | #80000b | #000000 | NA |
1 # of total rows: 165 |
Putting this data in a “tidy” format is rather straightforward. 1 2
colors_tidy <- teamcolors::teamcolors %>% tidyr::gather(ord, hex, -name, -league) colors_tidy %>% create_kable()
name | league | ord | hex |
---|---|---|---|
AFC Bournemouth | epl | primary | #e62333 |
Anaheim Ducks | nhl | primary | #010101 |
Arizona Cardinals | nfl | primary | #97233f |
Arizona Coyotes | nhl | primary | #010101 |
Arizona Diamondbacks | mlb | primary | #a71930 |
Arsenal | epl | primary | #ef0107 |
Atlanta Braves | mlb | primary | #ce1141 |
Atlanta Falcons | nfl | primary | #a71930 |
Atlanta Hawks | nba | primary | #e13a3e |
Atlanta United FC | mls | primary | #a29061 |
1 # of total rows: 660 |
Exploration
To begin, here’s visualization of all the colors in this data set. Not much significance can be extracted from this plot, but it’s still nice to have as a mechanism for getting familiar with the data.
Color Brightness
Note that there are quite a few teams without a full set of four colors (and some without a third or even second color).
colors_pct_nas <- colors_tidy %>% count(league, is_na = is.na(hex), sort = TRUE) %>% filter(is_na) %>% select(-is_na) %>% left_join( teamcolors::teamcolors %>% count(league, sort = TRUE) %>% rename(total = n) %>% mutate(total = as.integer(4 * total)), by = "league" ) %>% mutate(n_pct = 100 * n / total) %>% mutate_if(is.numeric, funs(round(., 2))) colors_pct_nas %>% create_kable()
league | n | total | n_pct |
---|---|---|---|
mlb | 47 | 120 | 39.17 |
nhl | 42 | 124 | 33.87 |
epl | 34 | 80 | 42.50 |
mls | 20 | 88 | 22.73 |
nba | 19 | 120 | 15.83 |
nfl | 2 | 128 | 1.56 |
Both the visualization and the tabulation indicate that the MLB is missing the most colors (on a per-team basis). Perhaps this suggests that it is the most “dull” sports league. 3 The NFL is on the other end of the spectrum (pun intended), with only 1.5% of missing color values. Is it a coincidence that the NFL is the most popular sport in the U.S.? 4
My subjective indictment of MLB as dull is certainly unfair and unquantitative. Does “dull” refer to hue, lightness, brightness, etc.? For the sake of argument, let’s say that I want to interpret dullness as “brightness”, which, in the color lexicon, is interpreted as the arithmetic mean of the red-green-blue (RGB) values of a color. To rank the leagues by brightness, I can take the average of the RGB values (derived from the hex values) across all colors for all teams in each league. The resulting values–where a lower value indicates a darker color, and a higher value indicates a brighter color–provide a fair measure upon which each league’s aggregate color choices can be judged.
But first, the reader should be aware of a couple of more technicalities: 5
I put this computation in a function because I perform the same actions multiple times. This practice complies with the DRY principle.
I was unable to get
grDevices::colo2rgb()
(and some other custom functions used elsewhere) to work in a vectorized manner, so I created a function (add_rgb_cols()
) to do so. I believe the problem is thatgrDevices::colo2rgb()
returns a matrix instead of a single value.Additionally, despite only using one element in the returned list here, I wrote the function to return a list of results because I was inspecting the different sets of results during code development.
Finally, I re-scale each RGB value to a value between 0 and 1–RGB is typically expressed on a 0 to 255 scale–in order to make the final values more interpretable.
add_rgb_cols <- function(data) { data %>% pull(hex) %>% grDevices::col2rgb() %>% t() %>% tibble::as_tibble() %>% bind_cols(data, .) } rank_leagues_byrgb <- function(data = NULL) { colors_rgb <- data %>% add_rgb_cols() %>% select(-hex) %>% arrange(league, name) colors_rgb_bynm_bylg <- colors_rgb %>% mutate_at(vars(red, green, blue), funs(. / 255)) %>% group_by(name, league) %>% summarize_at(vars(red, green, blue), funs(mean)) %>% ungroup() %>% tidyr::gather(rgb, value, red, green, blue) %>% group_by(name, league) %>% summarize_at(vars(value), funs(mean, sd)) %>% ungroup() %>% arrange(league, mean) colors_rgb_bylg <- colors_rgb_bynm_bylg %>% group_by(league) %>% summarize_at(vars(mean, sd), funs(mean)) %>% ungroup() %>% arrange(mean) colors_rgb_bylg } convert_dec2pct <- function(x) { 100 * round(x, 4) } colors_tidy_nona <- colors_tidy %>% filter(!is.na(hex)) colors_tidy_nona %>% rank_leagues_byrgb() %>% arrange(mean) %>% mutate_if(is.numeric, funs(convert_dec2pct)) %>% create_kable()
league | mean | sd |
---|---|---|
nhl | 30.46 | 14.52 |
nfl | 32.90 | 12.68 |
mlb | 33.75 | 13.16 |
epl | 36.56 | 15.79 |
mls | 38.59 | 12.05 |
nba | 40.99 | 10.37 |
This calculation proves what we might have guessed by inspection–the NHL actually has the darkest colors. In fact, it seems that the NHL’s “darkness” is most prominent in the primary colors of the teams in the league.
colors_tidy_nona %>% filter(ord == "primary") %>% rank_leagues_byrgb() %>% arrange(mean) %>% mutate_if(is.numeric, funs(convert_dec2pct)) %>% create_kable()
league | mean | sd |
---|---|---|
nhl | 9.13 | 9.20 |
nfl | 23.42 | 18.69 |
mlb | 29.90 | 29.00 |
mls | 32.16 | 22.74 |
epl | 37.35 | 29.10 |
nba | 37.95 | 30.10 |
On the other hand, the NBA and the two soccer leagues (the MLS and the EPL) stand out as the leagues with the “brightest” colors.
Finally, just by inspection, it seems like their is an unusual pattern where a disproportionate number of teams in the MLS, NBA, and NFL have shades of gray as their tertiary colors. Using the same function as before, it can be shown indirectly via relatively small standard deviation values that there is not much variation in this color.
colors_tidy_nona %>% filter(ord == "tertiary") %>% rank_leagues_byrgb() %>% arrange(sd) %>% mutate_if(is.numeric, funs(convert_dec2pct)) %>% create_kable()
league | mean | sd |
---|---|---|
nba | 34.23 | 9.39 |
nfl | 36.98 | 12.48 |
mls | 48.81 | 14.13 |
epl | 49.57 | 23.89 |
nhl | 42.10 | 28.15 |
mlb | 56.64 | 29.93 |
Common Colors
Using a slightly customized version of the plotrix::color.id()
function, I can attempt to identify common colors (by name) from the hex
values.
# Reference: plotrix::color.id color_id <- function(hex, set = grDevices::colors()) { c2 <- grDevices::col2rgb(hex) coltab <- grDevices::col2rgb(set) cdist <- apply(coltab, 2, function(z) sum((z - c2)^2)) set[which(cdist == min(cdist))] } identify_color_name <- function(col = NULL, set = grDevices::colors()) { col %>% # purrr::map(plotrix::color.id) %>% purrr::map(~color_id(.x, set)) %>% purrr::map_chr(~.[1]) %>% stringr::str_replace_all("[0-9]", "") }
I’ll bin the possible colors into a predefined set. (If a binning strategy is not implemented, one ends up with a more sparse, less meaningful grouping of colors.) This set consists of the “rainbow” colors, as well as black, white, and two shades of grey.
colors_rnbw_hex <- c( stringr::str_replace_all(grDevices::rainbow(16), "FF$", ""), "#FFFFFF", "#EEEEEE", "#AAAAAA", "#000000" ) colors_rnbw <- identify_color_name(colors_rnbw_hex)
Now, with the setup out of the way, I can easily compute the names of each color and identify the most common colors overall, as well as the most common primary and secondary colors.
add_color_nm_col <- function(data, rename = TRUE) { out <- data %>% pull(hex) %>% identify_color_name(set = colors_rnbw) %>% tibble::as_tibble() %>% bind_cols(data, .) if(rename) { out <- out %>% rename(color_nm = value) } out } colors_named <- colors_tidy_nona %>% add_color_nm_col() colors_named %>% count(color_nm, sort = TRUE) %>% create_kable()
color_nm | n |
---|---|
black | 173 |
red | 61 |
darkgray | 55 |
yellow | 43 |
gray | 36 |
darkgoldenrod | 35 |
orangered | 33 |
blue | 26 |
deepskyblue | 18 |
mediumspringgreen | 4 |
1 # of total rows: 15 |
colors_named %>% count(ord, color_nm, sort = TRUE) %>% filter(ord %in% c("primary", "secondary")) %>% group_by(ord) %>% mutate(rank_byord = row_number(desc(n))) %>% do(head(., 5)) %>% create_kable()
ord | color_nm | n | rank_byord |
---|---|---|---|
primary | black | 75 | 1 |
primary | red | 26 | 2 |
primary | blue | 16 | 3 |
primary | orangered | 14 | 4 |
primary | deepskyblue | 6 | 5 |
secondary | black | 46 | 1 |
secondary | red | 19 | 2 |
secondary | yellow | 19 | 3 |
secondary | darkgray | 17 | 4 |
secondary | gray | 12 | 5 |
Of course, a visualization is always appreciated.
ords <- ord_nums %>% pull(ord) color_nm_na <- "none" colors_named_compl <- colors_named %>% mutate(ord = factor(ord, levels = ords)) %>% select(-hex, -league) %>% tidyr::complete(name, ord, fill = list(color_nm = color_nm_na)) %>% tidyr::spread(ord, color_nm) colors_named_compl_ord2 <- colors_named_compl %>% filter(primary != secondary) %>% count(primary, secondary, sort = TRUE) %>% filter(primary != "none") %>% filter(secondary != "none") colors_named_compl_ord2_ig <- colors_named_compl_ord2 %>% igraph::graph_from_data_frame() igraph::V(colors_named_compl_ord2_ig)$node_label <- names(igraph::V(colors_named_compl_ord2_ig)) igraph::V(colors_named_compl_ord2_ig)$node_size <- igraph::degree(colors_named_compl_ord2_ig) lab_title_colors_named <- paste0("Colors", lab_base_suffix) lab_subtitle_colors_named <- paste0("Relationships Among Primary and Secondary Colors") # Reference: https://rud.is/books/21-recipes/visualizing-a-graph-of-retweet-relationships.html. viz_colors_named <- colors_named_compl_ord2_ig %>% ggraph::ggraph(layout = "linear", circular = TRUE) + # ggraph::ggraph(layout = "kk") + # ggraph::ggraph() + # ggraph::geom_edge_arc() + ggraph::geom_edge_arc( aes( edge_width = n / 3, # edge_alpha = n start_cap = ggraph::label_rect(node1.name, padding = margin(5, 5, 5, 5)), end_cap = ggraph::label_rect(node2.name, padding = margin(5, 5, 5, 5)) ) ) + ggraph::geom_node_text(aes(label = node_label)) + coord_fixed() + # teplot::theme_te() # ggraph::theme_graph(base_family = "Arial") + theme_void() + theme(plot.title = element_text(face = "bold", size = 16), plot.subtitle = element_text(size = 12)) + labs(title = lab_title_colors_named, subtitle = lab_subtitle_colors_named) # viz_colors_named
Additionally, given the “set” nature of the data set, I think that the {UpSetR} package can be used to create an intersection-style graph. 6
Neglecting the color black, which is unsurprisingly the most common
color, red has the highest count. (Consequently, it is deserving of use
as the fill for the bars in the following plot). 7
On the other hand, it’s a bit unsuprising to me that blue,
nor its its brethren in cyan
and deepskyblue
, isn’t among the top 2
or 3. One might argue that the three shades of blue inherently cause
classification to be “less focused”, but this does not seem to curb the
prominence of red, which also has two sister colors in orangered
and
darkpink
.
Color Clustering
Aside from set analysis, this data set seems prime for some unsupervised learning, and, more specifically, clustering. While analysis of the colors using RGB values as features can be done (and is actually what I tried initially), the results are not as interpretable as I would like them to be due to the “>2-dimensionality” nature of such an approach.
Thus, as an alternative to RGB components, I determined that “hue” serves as a reasonable all-in-one measure of the “essence” of a color. It is inherently a radial feature–its value can range from 0 to 360 (where red is 0 green is 120, blue is 240). 8
Then, by limiting the color sets to just the primary and secondary colors (such that there are only 2 features), I create a setting that allows the clustering results to be interpeted (and visualized) in a relatively direct manner.
With my setup decided upon, I implement a a tidy pipeline for statistical analysis–making heavy use of David Robinson’s {broom} package–to explore various values of k for a kmeans model. (The package’s kmeans vignette provides a really helpful example.)
While this visualization is fairly informative, it doesn’t quite pinpoint exactly which value of k is “most optimal”. There are [various methods for determining the optimal k-value for a kmeans model] (http://www.sthda.com/english/articles/29-cluster-validation-essentials/96-determining-the-optimal-number-of-clusters-3-must-know-methods/), one of which is the “Elbow” method. Basically, the point is to plot the within-cluster sum of squares (WSS) (i.e. variance) for each value of k–which is typically monotonically decreasing with increasing k–and pick the value of k that corresponds to the “bend” in the plot.
For those who enjoy calculus, the k value for which the second derivative of the curve is minimized is the optimal value (by the Elbow method).
kms_metrics$tot.withinss[1:8] %>% diff(differences = 2) %>% which.min() + 2 ## [1] 7
To complement this conclusion, I can use the fviz_nbclust()
function
fom the factoextra
package. It deduces the
optimal k value by the consensus of different methods.
It’s nice to see that this method comes to the same conclusion.
I’ll continue this analysis in a separate write-up. Unfortunately (or, perhaps, fortunately) there was too much to fit in a single document without it feeling overwhelming.
- The fact that the data comes in an easy-to-work-with format comes as a relief to those of us used to having to clean raw data tediously. ^
- Note that I use the name
ord
to represent “ordinality” of the color–that is, primary, secondary, tertiary, or quaternary. ^ - In fact, the current consensus among sports fans is that the MLB has a decaying fan-base in the U.S. because it is failing to attract younger fans. This opinion is typically based on conjectures about the game’s slow pace, but, who knows, maybe colors also has something to do with it! (I’m only kidding. I pride myself in guarding against the correlation-equals-causation fallacy.) ^
- Again, in case you think I’m serious, let me be clear–yes, it is most likely a coincidence. ^
- This probably should be a footnote :). ^
- After learning about the this package recently, I’m glad to finally have a use-case to use it! ^
- Unfortunately, it seems that customizing the colors for each set is not straightforward, so I do attempt it. ^
- Red is also 360. ^
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.