Analyzing My 2019 GitHub Usage in R
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Introduction
If you are anything like me, then you probably enjoy the contribution graphs that GitHub posts to both your own and others GitHub profile. You can see mine here. Since it is the beginning of a new year, I thought it would be fun to take a look back to see how I used GitHub in 2019 and in previous years. This is made much easier by using the gh
R package which provides an R interface to the GitHub API. This post will walk through how to get all of the commits for your personal repos (so your results will look different from mine). The gh
package will use the GITHUB_PAT
environment variable to access any personal access token you have previously set up. If you do not have that configured, than you can provide your token with the .token
argument.
library(tidyverse) library(gh) library(lubridate) library(glue)
We will also define a custom color palette based on the GitHub Style Guide to stick with the post’s theme.
gh_pal <- c(blue = "#0366d6", yellow = "#ffd33d", red = "#d73a49", green = "#28a745", purple = "#6f42c1", light_green = "#dcffe4")
Getting Repos
First, we want to get a listing of all of the repos that are associated with my GitHub account. We can do this using the /user/repos
API endpoint.
repos <- gh("/user/repos", .limit = Inf)
We can extract the desired information from the repos
using the map_chr
and map_lgl
functions in the purrr
package.
repo_info <- tibble( owner = map_chr(repos, c("owner", "login")), name = map_chr(repos, "name"), full_name = map_chr(repos, "full_name"), private = map_lgl(repos, "private") ) # showing the first few non-private repos filter(repo_info, !private) ## # A tibble: 35 x 4 ## owner name full_name private ## <chr> <chr> <chr> <lgl> ## 1 saleslab DWC saleslab/DWC FALSE ## 2 saleslab ParsingLabSolutionsA~ saleslab/ParsingLabSolutionsAS~ FALSE ## 3 tbradley1~ adv-r tbradley1013/adv-r FALSE ## 4 tbradley1~ aob-power tbradley1013/aob-power FALSE ## 5 tbradley1~ blog tbradley1013/blog FALSE ## 6 tbradley1~ connectapi tbradley1013/connectapi FALSE ## 7 tbradley1~ critical-thinking-ev~ tbradley1013/critical-thinking~ FALSE ## 8 tbradley1~ dbp-calculator tbradley1013/dbp-calculator FALSE ## 9 tbradley1~ dragondown tbradley1013/dragondown FALSE ## 10 tbradley1~ dragondown_book tbradley1013/dragondown_book FALSE ## # ... with 25 more rows
There is a lot of other information contained within the repos
API response. However, since I will be focusing mainly on my commits, I don’t need most of it for this purpose.
Get and Parse Commits
In order to get all of the commits for each of these repos, we will use the map_dfr
to append all of the commit information for each repo into a single dataset. Before we do that, we need to define a few helper functions that can parse the output of the API endpoint response.
null_list <- function(x){ map_chr(x, ~{ifelse(is.null(.x), NA, .x)}) } parse_commit <- function(commits, repo){ # browser() commit_by <- map(commits, c("commit", "author", "name")) username <- map(commits, c("committer", "login")) commit_time <- map(commits, c("commit", "author", "date")) message <- map(commits, c("commit", "message")) out <- tibble( repo = repo, commit_by = null_list(commit_by), username = null_list(username), commit_time = null_list(commit_time), message = null_list(message) ) out <- mutate(out, commit_time = as.POSIXct(commit_time, format = "%Y-%m-%dT%H:%M:%SZ")) return(out) } gh_safe <- purrr::possibly(gh, otherwise = NULL)
The parse_commit
function will extract the desired information from each API response. The null_list
function is a simple helper to convert any NULL values in the response to NA, so that the map
functions don’t throw errors. Finally, the gh_safe
function is a safe version of the gh
function. This is defined in case any of the individual responses fail, it doesn’t cause the entire loop to fail.
Now we can query all of the commits from each of these repos and filter the output to include only commits that I made.
all_commits <- map_dfr(repo_info$full_name, function(z){ name_split <- str_split(z, "/") owner <- name_split[[1]][1] repo <- name_split[[1]][2] repo_commits <- gh_safe("/repos/:owner/:repo/commits", owner = owner, repo = repo, .limit = Inf) out <- parse_commit(repo_commits, repo = z) return(out) }) my_commits <- all_commits %>% filter(commit_by == "Tyler Bradley") %>% mutate(commit_time = commit_time - hours(5)) my_commits ## # A tibble: 3,550 x 5 ## repo commit_by username commit_time message ## <chr> <chr> <chr> <dttm> <chr> ## 1 nmanna/~ Tyler Bra~ tbradley~ 2019-10-25 07:16:50 rendering changes ## 2 nmanna/~ Tyler Bra~ tbradley~ 2019-10-25 07:08:12 Merge branch 'master'~ ## 3 nmanna/~ Tyler Bra~ tbradley~ 2019-10-25 07:08:03 changes from rendering ## 4 nmanna/~ Tyler Bra~ tbradley~ 2019-10-25 07:07:23 force style change ## 5 nmanna/~ Tyler Bra~ tbradley~ 2019-10-25 06:45:06 adding public back to~ ## 6 nmanna/~ Tyler Bra~ tbradley~ 2019-10-25 06:44:44 removing public ## 7 nmanna/~ Tyler Bra~ tbradley~ 2019-10-25 06:37:32 Merge branch 'master'~ ## 8 nmanna/~ Tyler Bra~ tbradley~ 2019-10-25 06:37:12 rendered on server ## 9 nmanna/~ Tyler Bra~ tbradley~ 2019-10-25 06:35:46 removed public ## 10 nmanna/~ Tyler Bra~ tbradley~ 2019-10-25 06:33:36 added post for settin~ ## # ... with 3,540 more rows
Analyzing the results
First, we can look at the overall number of commits I have made per year since I started using GitHub in 2017. Before we do that, we will add a few columns to the my_commits
dataset to include grouping variables based on the commit date and time. We will also join the repo_info
dataset to the my_commits
dataset.
my_commits <- my_commits %>% mutate( date = date(commit_time), wday = wday(date, label = TRUE), year = year(date), week = week(date) ) %>% left_join( repo_info, by = c("repo" = "full_name") ) %>% filter(year <= 2019) my_commits %>% count(year) %>% ggplot(aes(as.character(year), n)) + geom_col(fill = gh_pal["green"]) + geom_text(aes(y = (n-35), label = paste(n, "commits")), color = "white", fontface = "bold") + theme_bw() + scale_y_continuous(expand = c(0, 0), limits = c(0, 2050)) + scale_x_discrete(expand = c(0.17, 0.17)) + labs( title = "Commits by Tyler Bradley (tbradley1013) since joining GitHub by Year", y = "Number of Commits" ) + theme( axis.title.x = element_blank(), panel.grid.minor.x = element_blank() )
We can see that I have used GitHub more and more over the last three years, with my highest number of commits coming in 2019 (n = 1988). The same trend is true when looking at my commits to public and private repos.
my_commits %>% count(year, private) %>% mutate(private = ifelse(private, "Private", "Public"), year = factor(year, levels = unique(year)), text_x = ifelse(private == "Private", as.numeric(year) - 0.22, as.numeric(year) + 0.22)) %>% ggplot(aes(year, n, fill = private)) + geom_col(position = "dodge") + geom_text(aes(text_x, n-20, label = paste(n, "commits")), color = "white", fontface = "bold") + theme_bw() + # ggsci::scale_f scale_fill_manual(values = c(unname(gh_pal["blue"]), unname(gh_pal["green"]))) + scale_y_continuous(expand = c(0, 0), limits = c(0, 1550)) + scale_x_discrete(expand = c(0.17, 0.17)) + labs( title = "Number of GitHub Commits by Tyler Bradley (tbradley1013) to Public and Private Repos by Year", y = "Number of Commits" ) + theme( axis.title.x = element_blank(), legend.title = element_blank() )
We can also use ggplot2
to recreate the GitHub contribution heatmap. There is a minor bit of hacking to get the axes in the desired format showing the start to the month, but it can be done like this.
my_commits %>% count(date, wday, year, week) %>% mutate( week = factor(week), wday = fct_rev(wday) ) %>% group_by(year, week) %>% mutate( min_date = floor_date(date, "week"), min_date = if_else( year(min_date) < year, as.Date(str_replace(min_date, as.character(year-1), "1998")), as.Date(str_replace(min_date, as.character(year), "1999")) ) ) %>% ungroup() %>% ggplot(aes(min_date, wday, fill = n)) + facet_wrap(~year, ncol = 1) + geom_tile(width = 5, height = 0.9, color = "black") + theme_bw() + scale_y_discrete(expand = c(0,0)) + scale_x_date(date_breaks = "1 month", date_labels = "%b", expand = c(0, 0)) + labs( title = "Tyler Bradley's (tbradley1013) GitHub contributions heat map by year" ) + theme( panel.grid.major = element_blank(), panel.grid.minor = element_blank(), axis.title = element_blank(), legend.title = element_blank() ) + scale_fill_gradient(low = gh_pal["light_green"], high = gh_pal["green"])
Now we can look just at the last year and see when I am the most productive according to commits. We can group the commits by the day of the week and the time of day to see if any patterns can be seen.
my_commits %>% filter(year == 2019) %>% mutate( hour = hour(commit_time) ) %>% count(hour) %>% mutate( text_y = ifelse(n < 50, n+5, n-5), text_color = ifelse(n < 50, "black", "white") ) %>% ggplot(aes(hour, n)) + geom_col(fill = gh_pal["green"]) + geom_text(aes(y = text_y, color = text_color, label = paste("n =", n)), show.legend = FALSE, size = 3) + theme_bw() + scale_x_continuous(breaks = seq(0, 23, 2), labels = seq(0, 23, 2)) + scale_y_continuous(expand = c(0, 0), limits = c(0, 310)) + scale_color_manual(values = c("black", "white")) + labs( title = "Commits by Tyler Bradley (tbradley1013) in 2019 by time of day", y = "Number of Commits", x = "Time of Day (Hour)" )
My most productive time periods are clearly between 9 am and 2 pm. Over the course of that time it appears that I am fairly consistent in my productivity. This period of productivity corresponds to the time when I am in the office at work. The time periods right outside of that window (7am-8am and 3pm-4pm) are typically the beginning and end of my work day so I am either getting my day started or wrapping things up at that point. It is clear from this figure that I don’t tend to do any work from 10pm-5am which conforms as expected with my sleep schedule.
Similarly, my commits by day of the week conforms with an expected pattern that my most productive periods are when I am at work in the office Monday-Friday.
my_commits %>% filter(year == 2019) %>% count(wday) %>% ggplot(aes(wday, n)) + geom_col(fill = gh_pal["green"]) + geom_text(aes(y = n-10, label = paste("n =", n)), color = "white", show.legend = FALSE) + theme_bw() + scale_y_continuous(expand = c(0, 0), limits = c(0, 475)) + scale_color_manual(values = c("black", "white")) + labs( title = "Commits by Tyler Bradley (tbradley1013) in 2019 by day of the week", y = "Number of Commits", x = "Day of the week" )
Conclusion
Overall, 2019 was a very productive year for me in terms of GitHub commits! At this point, I am very committed to the git/GitHub workflow and expect that my commits will continue to either follow an upward trend or reach a plateau as I continue to take on new and exciting projects at work and in school!
The gh
package allows R users to easily interact with the GitHub API and analyze how they are utilizing the tools available through GitHub.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.