Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
I recently came across the article “How Mapping Shots In The NBA Changed It Forever” and although I am not a big basketball fan, I was impressed by the visualizations. I actually bought the book “Sprawlball” by Kirk Goldsberry afterwards, where this was taken from. I can only recommend it, even if you are not a huge basketball fan since it contains a lot of inspirational dataviz material.
In this post, I am going to try to replicate the first three visualizations of the linked article, which turned out to be harder than expected! Additionally, I will use the rayshader
package to turn one of the figures into a
nice 3D plot.
library(tidyverse) library(showtext) library(hexbin)
Data
When I decided to do this replication, I was pretty sure that it is doomed to failure since I will
not get the right data (precise shot locations). Turns out, stats.nba.com has an incredible rich (but undocumented)
API. I will not go into detail on how I got all the relevant data, but if you want to get it yourself,
check out toddwschneider’s R package ballr
, specifically this function.
I took several hours, but at the end I managed to put together a dataset of all shots in the NBA between 2006 and 2019. In the following, we will only need 2013-2018 though.
Scatterplot of shot locations
The first figure in the article is the easiest to reproduce, since it is a simple
scatterplot. The only thing that took a bit of time was to find a suitable . In the end, I settled
with Rokkitt from Google, which can be loaded via the showtext
package as follows.
_add_google("Rokkitt", "rokkitt") showtext_auto()
A second issue was to get the court in the background. The ballr
package luckily has a plot_court
function
which I tweaked slightly for my purposes.
col_scheme <- c("firebrick4","darkorange2","gold1", "chartreuse4","cadetblue2","darkorchid4") p <- plot_court()+ geom_point(data = scatter_shots, aes(loc_x,loc_y,col = shot_distance),size = 0.1)+ scale_color_gradientn(colours = col_scheme,breaks = c(0,30),limits = c(0,30), name = "SHOT DISTANCE IN FEET")+ guides(colour = guide_colourbar(barwidth = 11, barheight = 0.8,ticks = F, title.position = "top",title.hjust = 0.5))+ labs(title = "NBA SHOT LOCATIONS\n2014-15")+ theme(plot.title = element_text(hjust=0.5,size = 32,colour="grey45", family="rokkitt",face = "bold"), legend.position = "bottom",legend.justification = c(0.88,0), legend.text = element_text(size=14,colour = "grey45",family = "rokkitt"), legend.title = element_text(size=13,colour = "grey45",family = "rokkitt"), legend.box.margin = margin(c(-50,0,3,0)), plot.margin = margin(3,3,3,3, unit = "pt")) ggsave("shots2014-15.png",plot = p ,width = 10, height = 8, dpi = 100)
Looks pretty close to the original. Although it was quite straightforward to reproduce this figure, I still learned two things: How to tweak a colourbar legend with guide_colourbar
and how to use the legend.justification argument of theme
.
Field goal percentages
Reproducing the second figure of the article is way more involved than the first (at least it was for me). The figure has a lot of very smooth areas and I would assume that it is more of an illustration than an accurate depiction of the data (which is of course totally fine in this context!). I tried out several things, and in the end I settled for a combination of “hexbin aggregation” and interpolating the data on a regular grid.
The first step was to aggregate shots in roughly the same location into cells and calculate the field goal percentages
for each cell. I did so using the hexbin
function of the package with the same name.
tbl <- vroom::vroom(c("shots2013-14.csv","shots2014-15.csv","shots2015-16.csv", "shots2016-17.csv","shots2017-18.csv")) %>% select(loc_x,loc_y,shot_made_numeric,shot_value,shot_distance) bns <- 20 hb <- hexbin(x = tbl$loc_x,y = tbl$loc_y,xbins = bns,IDs = TRUE) shot_perc <- tbl %>% mutate(hex = hb@cID) %>% group_by(hex) %>% dplyr::summarise(perc = sum(shot_made_numeric)/n()) shot_perc <- left_join(shot_perc,tibble(hex = hb@cell,x = hb@xcm,y = hb@ycm),by = "hex")
I played around with the number of bins and found that 20 seemed to be a good trade-off.
Afterwards, I created a regular grid with 1000 points per dimension and interpolated the
shot percentages with the interp
function from the akima
package.
npts <- 1000 xyz_interp <- akima::interp(x = shot_perc$x,y = shot_perc$y, z = shot_perc$perc,nx = npts,ny = npts,linear = T) shot_perc_plot <- tibble(x = rep(xyz_interp[[1]],npts), y = rep(xyz_interp[[2]],each=npts), z = c(xyz_interp[[3]])) %>% dplyr::filter(!is.na(z) & y >= -2) %>% mutate(z=if_else(y < 0 & z <= 0.35,0.36,z)) %>% mutate(z=if_else(y < 0 & z >= 0.45 & x > 5,0.45,z)) %>% mutate(zfill=case_when(z <= 0.35~0.35, z <= 0.4~0.40, z <= 0.45~0.45, z <= 0.50~0.50, z <= 0.55~0.55, z <= 0.60~0.6, TRUE~0.65)) %>% mutate(zfill = as.factor(zfill)) %>% dplyr::filter(sqrt(x^2+(y-hoop_center_y)^2)<=27)
shot_perc_plot
now contains (interpolated) field goal percentages on a regular grid
for the whole court. Visualizing this is now relatively straightforward with a combination
of geom_tile
, geom_contour
and some annotation sugar.
p <- ggplot(shot_perc_plot)+ geom_tile(aes(x,y,fill=zfill))+ geom_contour(aes(x,y,z=z), binwidth = 0.03,col="white",size=0.2,linetype="dashed")+ geom_path(data = court_points, aes(x = x, y = y-5, group = desc), color = "black",size=0.8 ) + geom_hline(yintercept = -5,size=0.8)+ annotate("text",x = 0,y = 15,size = 12, label = "40-45%",colour = "#8b5122",family = "rokkitt")+ annotate("text",x = -15,y = 24,size = 12, label = "30-35%",colour = "#f6eac6",family = "rokkitt")+ annotate("text",x = -23,y = 5,size = 12, label = "35-40%",colour = "white",family = "rokkitt",angle = 90)+ annotate("text",x = 23,y = 3,size = 12, label = "35-40%",colour = "white",family = "rokkitt",angle = 270)+ annotate("text",x = 11,y = 2,size = 12, label = "35-40%",colour = "white",family = "rokkitt")+ scale_fill_manual(values = c("#8b5122", "#d8b36b", "#f6eac6", "#FFFFFF", "#c9e8e5", "#5ab4ac", "#0f665e"), name="FIELD GOAL PERCENTAGE BY AREA", labels=c("30%","35%","40%","45%","50%","55%","60%"))+ guides(fill = guide_legend(nrow = 1,title.position = "top",title.hjust = 0.5, label.position = "bottom"))+ coord_fixed(ylim = c(-5,30),xlim=c(-22,22))+ labs(title="FIELD GOAL PERCENTAGES\n2013-14 TO 2017-18")+ theme_minimal(base_size = 22) + theme(plot.title = element_text(hjust = 0.5,size = 48,colour = "grey45", family = "rokkitt",face = "bold",lineheight = 0.6), legend.position = "bottom",legend.justification = c(0.98,0), legend.text = element_text(size = 20,colour = "grey45", family = "rokkitt",angle = 45,hjust = 0.5), legend.title = element_text(size = 22,colour = "grey45",family = "rokkitt"), legend.box.margin = margin(c(-50,0,3,0)), plot.margin = margin(3,3,3,3, unit = "pt"), plot.background = element_rect(fill = "white", color = "white"), panel.background = element_rect(fill = "white", color = "white"), panel.grid = element_blank(), panel.border = element_blank(), axis.text = element_blank(), axis.title = element_blank(), axis.ticks = element_blank()) ggsave("fgp2013-18.png",plot = p ,width = 10, height = 8, dpi = 200)
The result is not as smooth as the original but the overall picture is very close to it.
Points per shot
The third figure in the article is pretty much the same as the second. Just with different data. So we can use the same aggregation/interpolation strategy as before.
bns <- 20 hb <- hexbin(x = tbl$loc_x,y = tbl$loc_y,xbins = bns,IDs = TRUE) shot_pts <- tbl %>% mutate(hex = hb@cID) %>% group_by(hex) %>% dplyr::summarise(pts = sum(shot_made_numeric*shot_value)/n()) shot_pts <- left_join(shot_pts,tibble(hex = hb@cell,x = hb@xcm,y = hb@ycm),by="hex") npts <- 1000 xyz_interp <- akima::interp(x = shot_pts$x,y = shot_pts$y, z = shot_pts$pts,nx = npts,ny = npts,linear = T) shot_pts_plot <- tibble(x = rep(xyz_interp[[1]],npts), y = rep(xyz_interp[[2]],each=npts), z = c(xyz_interp[[3]])) %>% dplyr::filter(!is.na(z) & y>= -2) %>% mutate(z=if_else(y<1 & z>=0.85 & (between(x,5,20) | between(x,-20,-5)),0.85,z)) %>% mutate(zfill=case_when(z<=0.85~0.85, z<=0.90~0.90, z<=0.95~0.95, z<=1.00~1.00, z<=1.05~1.05, z<=1.10~1.1, z<=1.2~1.2, TRUE~1.25)) %>% mutate(zfill=as.factor(zfill)) %>% dplyr::filter(sqrt(x^2+(y-hoop_center_y)^2)<=27)
If you look closely at the third figure, you will realize that there is a strong cut between the two and three point area. So technically, we should have treated the two areas separately and combine the results in the end. However, I did not manage to do this in a satisfying way, so I stuck with treating both areas together, which will of course change the figure slightly.
p <- ggplot(shot_pts_plot)+ geom_tile(aes(x,y,fill=zfill))+ geom_contour(aes(x,y,z=z),binwidth = 0.05,col="white",size=0.2,linetype="dashed")+ geom_path(data = court_points, aes(x = x, y = y-5, group = desc), color = "black",size=0.8 ) + geom_hline(yintercept = -5,size=0.8)+ annotate("text",x=-14,y=11,size=12, label="UNDER 0.85",colour="#e8d6e8",family="rokkitt")+ annotate("text",x=14,y=11,size=12, label="UNDER 0.85",colour="#e8d6e8",family="rokkitt")+ annotate("text",x=0,y=15,size=12, label="0.85-0.9",colour="#e8d6e8",family="rokkitt")+ annotate("text",x=-27,y=31,size=9, label="BETWEEN 2013-14 AND\n2017-18, THE AVERAGE NBA\nSHOT YIELDED 1.02\nPOINTS", hjust=0,vjust=1,family="rokkitt",lineheight = 0.6)+ scale_fill_manual(values=c("#772d83","#9871ac","#c2a7cf","#e8d6e8", "#daecd3","#a8d6a2","#59ae64","#10632f"), name="AVERAGE POINTS PER FIELD GOAL ATTEMPT", labels=c("0.85","0.90","0.95","1.00","1.05","1.10","1.20",""))+ guides(fill = guide_legend(nrow = 1,title.position = "top",title.hjust = 0.5, label.position = "bottom"))+ coord_fixed(ylim = c(-5,30),xlim=c(-22,22),clip="off")+ labs(title="POINTS PER SHOT\n2013-14 TO 2017-18")+ theme_minimal(base_size = 22) + theme(plot.title = element_text(hjust=0.5,size = 48,colour="grey45", family="rokkitt",face = "bold",lineheight = 0.5), legend.position = "bottom",legend.justification = c(0.98,0), legend.text = element_text(size=20,colour="grey45", family="rokkitt",angle = 45,hjust=0.5), legend.title = element_text(size=22,colour="grey45",family="rokkitt"), legend.box.margin=margin(c(-50,0,3,0)), plot.margin = margin(3,3,3,3, unit = "pt"), plot.background = element_rect(fill = "white", color = "white"), panel.background = element_rect(fill = "white", color = "white"), panel.grid = element_blank(), panel.border = element_blank(), axis.text = element_blank(), axis.title = element_blank(), axis.ticks = element_blank()) ggsave("pps2013-18.png",plot = p ,width = 10, height = 8, dpi = 200)
Again, the result is not as smooth as the original. Especially the boundary of the two and three point area looks different. Overall though, I think the figure captures the essential result of the original.
3D plot of points per shot
Recently, I learned about the awesome rayshader
package by
tylermorganwall. Especially the relatively
new feature of turning any ggplot object into a 3D plot is fascinating! The below code
turns the points per shot figure into a 3D object and produces a short video that navigates the
plot.
plot_gg(p, width = 5, height = 5, multicore = TRUE, scale = 250, zoom = 0.7, theta = 10, phi = 30, windowsize = c(800, 800)) phivec <- 20 + 70 * 1/(1 + exp(seq(-5, 10, length.out = 180))) phivecfull <- c(phivec, rev(phivec)) thetavec <- 90 * sin(seq(0,359,length.out = 360) * pi/180) #+270 zoomvec <- 0.5 + 0.5 * 1/(1 + exp(seq(-5, 10, length.out = 180))) zoomvecfull <- c(zoomvec, rev(zoomvec)) for(i in 1:360) { render_camera(theta = thetavec[i],phi = phivecfull[i],zoom = zoomvecfull[i]) render_snapshot(paste0("anim/frame", i, ".png")) } #Run this command in the command line using ffmpeg to stitch together a video: #ffmpeg -framerate 60 -i frame%d.png -vcodec libx264 raymovie.mp4 #And run this command to convert the video to post to the web: #ffmpeg -i raymovie.mp4 -pix_fmt yuv420p -profile:v baseline -level 3 -vf scale=-2:-2 rayweb.mp4
This is the result:
< video width="450" controls> < source src="http://blog.schochastics.net/img/rayweb.mp4" type="video/mp4">R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.