Six of One (Plot), Half-Dozen of the Other
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
This is a guest post by Randy Zwitch (@randyzwitch), a digital analytics and predictive modeling consultant in the Greater Philadelphia area. Randy blogs regularly about Data Science and related technologies at http://randyzwitch.com. He’s blogged at Bad Hessian before here.

For those of you with WordPress blogs and have the Jetpack Stats module installed, you’re intimately familiar with this chart. There’s nothing particularly special about this chart, other than you usually don’t see bar charts with the bars shown superimposed.
I wanted to see what it would take to replicate this chart in R, Python and Julia. Here’s what I found. (download the data).
R: ggplot2
Although I prefer to use other languages these days for my analytics work, there’s a certain inertia/nostalgia that when I think of making charts, I think of using ggplot2 and R. Creating the above chart is pretty straightforward, though I didn’t quite replicate the chart, as I couldn’t figure out how to make my custom legend not do the diagonal bar thing.
The R Cookbook talks about a hack to remove the diagonal lines from legends, so I don’t feel too bad about not getting it. I also couldn’t figure out how to force ggplot2 to give me the horizontal line at 10000. If anyone in the R community knows how to fix these, let me know!
library(ggplot2) | |
#Load Data | |
#Make sure months stay ordered - first time I ever wanted a factor! | |
#Label x-axis with every fifth label | |
visits_visitors <- read.csv("visits_visitors.csv") | |
visits_visitors$Month <- factor(visits_visitors$Month, levels = visits_visitors$Month, ordered = TRUE) | |
visits_visitors$Month_ <- ifelse(as.numeric(row.names(visits_visitors)) %% 5 == 0, as.character(visits_visitors$Month), "") | |
#Build plot as a series of elements | |
ggplot() + | |
geom_bar(data=visits_visitors, aes(x= Month, y= Views, colour = "lightblue"), stat = "identity", fill = '#278DBC') + | |
geom_bar(data=visits_visitors, aes(x= Month, y= Visitors, colour="navyblue"), stat="identity", fill = "navyblue", width = .6) + | |
scale_y_continuous(breaks=c(0,2000,4000,6000,8000,10000)) + | |
scale_x_discrete(labels=visits_visitors$Month_) + | |
xlab("") + | |
ylab("") + | |
scale_colour_manual(name = '', values =c('lightblue'='#278DBC','navyblue'='navyblue'), labels = c('Views','Visitors'))+ | |
scale_fill_manual(values =c('lightblue'='#278DBC','navyblue'='navyblue'))+ | |
theme( | |
panel.grid.major.x = element_blank(), | |
panel.grid.minor.y = element_blank(), | |
panel.grid.major.y = element_line(size=.05, color="gray" ), | |
panel.background = element_rect(fill='white', colour='white'), | |
axis.ticks = element_blank(), | |
legend.position = c(.9, 1), | |
legend.direction = "horizontal" | |
) |
R: Base Graphics
Of course, not everyone finds ggplot2 to be easy to understand, as it requires a different way of thinking about coding than most ‘base’ R functions. To that end, there are the base graphics built into R, which produced this plot:
#Load Data | |
#Make sure months stay ordered - first time I ever wanted a factor! | |
#Label x-axis with every fifth label...need to use character(0) in base graphics, not " "? | |
visits_visitors <- read.csv("visits_visitors.csv") | |
visits_visitors$Month <- factor(visits_visitors$Month, levels = visits_visitors$Month, ordered = TRUE) | |
visits_visitors$Month_ <- ifelse(as.numeric(row.names(visits_visitors)) %% 5 == 0, as.character(visits_visitors$Month), character(0)) | |
#Set up plot space for plot & add horizontal lines | |
barplot(visits_visitors$Views, ylim = c(0,10000), las=1, border = NA, col.axis = "darkgray", tick = FALSE) | |
abline(h=seq(2000,10000, 2000), col='lightgray') | |
#Make views barplot | |
barplot(visits_visitors$Views, names.arg=visits_visitors$Month_, col = "#278DBC", | |
border = NA, add = TRUE, yaxt='n', ann = FALSE, col.axis = "darkgray") | |
#Make visitors barplot | |
#Width & space parameters move bars from their "center points" on the x-axis? | |
barplot(visits_visitors$Visitors, col = "navyblue", | |
border = NA, add = TRUE, yaxt='n', xaxt= 'n', ann = FALSE, | |
legend.text = c("Views", "Visitors"), | |
args.legend = list(x="topright", ncol = 2, fill = c("#278DBC", "navyblue"), bty='n', border=FALSE)) | |
Python: matplotlib
In the past year or so, there’s been quite a lot of activity towards improving the graphics capabilities in Python. Historically, there’s been a lot of teeth-gnashing about matplotlib being too low-level and hard to work with, but with enough effort, the results are quite pleasant. Unlike with ggplot2 and base R, I was able to replicate all the features of the WordPress plot:
import pandas as pd | |
import matplotlib.pyplot as plt | |
%matplotlib inline | |
#Read data into Python | |
dataset= pd.read_csv("visits_visitors.csv") | |
#Create every fifth month label | |
dataset["Month_"]= [value if (position+1) % 5 == 0 else "" for position, value in enumerate(dataset.Month)] | |
#Plot 1 | |
plotviews= dataset.Views.plot(kind='bar', figsize=(17, 6), width = .9, color = '#278DBC', edgecolor= 'none', grid = False, clip_on=False) | |
#Plot 2 - All options here control the result plot | |
plotvisitors= dataset.Visitors.plot(kind='bar', figsize=(17, 6), width = .65, color = '#000099', edgecolor= 'none', grid = False, clip_on=False) | |
plotvisitors.set_xticklabels(dataset.Month_, rotation=0) | |
#Remove plot borders | |
for location in ['right', 'left', 'top', 'bottom']: | |
plotvisitors.spines[location].set_visible(False) | |
#Fix grid to be horizontal lines only and behind the plots | |
plotvisitors.yaxis.grid(color='gray', linestyle='solid') | |
plotvisitors.set_axisbelow(True) | |
#Turn off x-axis ticks | |
plotvisitors.tick_params(axis='x',which='both', bottom='off', top='off', labelbottom='on') | |
plotvisitors.tick_params(axis='y',which='both', left='off', right='off', labelbottom='on') | |
#Create proxy artist to generate legend | |
views= plt.Rectangle((0,0),1,1,fc="#278DBC", edgecolor = 'none') | |
visitors= plt.Rectangle((0,0),1,1,fc='#000099', edgecolor = 'none') | |
l= plt.legend([views, visitors], ['Views', 'Visitors'], loc=1, ncol = 2) | |
l.draw_frame(False) |
Python: Seaborn
One of the aforementioned improvements to matplotlib is Seaborn, which promises to be a higher-level means of plotting data than matplotlib, as well as adding new plotting functionality common in statistics and research. Re-creating this plot using Seaborn is a waste of the additional functionality of Seaborn, and as such, I found it more difficult to make this plot using Seaborn than I did with matplotlib.
To replicate the plot, I ended up hacking a solution together using both Seaborn functionality and matplotlib in order to be able to set bar width and to create the legend, which defeats the purpose of using Seaborn in the first place.
import seaborn as sns | |
import matplotlib.pyplot as plt | |
#Set theme and size, remove axis titles | |
sns.set_style("whitegrid") | |
sns.set_context({"figure.figsize": (17, 6)}) #Passing in matplotlib commands directly | |
#Plot data | |
#Use dataset.index as my x_order, to ensure proper ordering by using numeric series | |
sns_views = sns.barplot(dataset.Month, "Views", data = dataset, color = "#278DBC", dropna = False, ci=None, x_order=dataset.Month) | |
#Hack to use matplotlib directly to set bar width, since Seaborn doesn't currently allow for setting bar width | |
plotvisitors2= dataset.Visitors.plot(kind='bar', width = .5, color = '#000099', clip_on=False, grid=False) | |
plotvisitors2.yaxis.grid(color='gray', linestyle='solid') | |
plotvisitors2.set_axisbelow(True) | |
#Remove borders around graph, remove axis labels | |
#Needs to be called after plot generated or it doesn't work | |
sns.despine(left=True) | |
sns.axlabel("", "") | |
#Set custom labels for chart | |
sns_views.set_xticklabels(dataset.Month_, rotation=0) | |
#Create legend using same matplotlib code from above | |
sns_views.legend([views, visitors], ['Views', 'Visitors'], loc=1, ncol = 2) |
Julia: Gadfly
In the Julia community, Gadfly is clearly the standard for plotting graphics. Supporting d3.js, PNG, PS, and PDF, Gadfly is built to work with many popular back-end environments. I was able to replicate everything about the WordPress graph except for the legend:
using Gadfly, DataFrames | |
set_default_plot_size(24cm, 8.5cm) | |
#Read in data using DataFrames package | |
df = readtable("visits_visitors.csv") | |
#Fill :Visitors with 0 to replace NA | |
df[:Visitors] = array(df[:Visitors], 0) | |
plot(df, | |
layer(x="Month", y="Visitors", Geom.bar, | |
Theme(default_color=color("#000099"), bar_spacing = 3mm)), | |
layer(x="Month", y="Views", Geom.bar, | |
Theme(default_color=color("#278DBC"), bar_spacing = 1mm)), | |
layer(yintercept=[0:2000:10000], Geom.hline, Theme(default_color=color("lightgray"))), #Make horizontal lines | |
#Plot properties, not layer properties | |
Guide.xlabel(""), | |
Guide.ylabel(""), | |
Guide.yticks(ticks=[0:2000:10000]), #Set ticks every 2000 | |
Guide.xticks(ticks=[0:5:30]), #Print every 5th label | |
Scale.y_continuous(format=:plain), #Make y labels print normally, instead of scientific notation | |
Theme(grid_color=color("white"), grid_color_focused=color("white"))) #Hide grid |
Julia: Plot.ly
Plot.ly is an interesting ‘competitor’ in this challenge, as it’s not a language-specific package per-se. Rather, Plot.ly is a means of specifying plots using JSON, with lightweight Julia/Python/MATLAB/R wrappers. I was able to replicate nearly everything about the WordPress plot, with the exception of not having a line at 10000, having the legend vertical instead of horizontal and I couldn’t figure out how to set the bar widths separately.
using Plotly, DataFrames | |
Plotly.signin("username", "api-key") | |
#Read in data | |
df = readtable("visits_visitors.csv") | |
#Create page views plot | |
views = [ | |
["x" => df[:Month], "y" => df[:Views], "type" => "bar", "name" => "Views", "marker" => ["color" => "rgb(39, 141, 188)"]] | |
] | |
#Create visitors plots | |
visitors = [ | |
["x" => df[:Month], "y" => df[:Visitors], "type" => "bar", "name" => "Visitors", "marker" => ["color" => "rgb(0, 0, 153)"]] | |
] | |
layout_views = [ | |
"xaxis" => ["autotick" => false, "dtick" => 1, "tick0" => 1], #Seems to have no effect | |
"bargap" => 10, #Seems to have no effect | |
] | |
layout_visitors = [ | |
"showlegend" => true, | |
"legend" => ["x" => 1, "y" => 1], | |
"yaxis" => ["autotick" => false, "dtick" => 2000], | |
"xaxis" => ["autotick" => false, "dtick" => 5, "tick0" => 4], | |
"bargap" => 0.1, | |
"barmode" => "overlay" | |
] | |
#Make API calls - second call allows for overlaying graphs | |
response = Plotly.plot(views, ["filename" => "basic-bar", "fileopt" => "overwrite", "layout"=> layout_views]) | |
response = Plotly.plot(visitors, ["filename" => "basic-bar", "fileopt" => "append", "layout"=> layout_visitors]) | |
#Print inline in IJulia Notebook | |
s = string("<iframe height='450' id='igraph' scrolling='no' seamless='seamless' src='", response["url"], | |
"/1000/450' width='1050'></iframe>") | |
display("text/html", s) |
And The Winner Is…matplotlib?!
If you told me at the beginning of this exercise that matplotlib (and by extension, Seaborn) would be the only library that I would be able to replicate all the features of the WordPress graph, I wouldn’t have believed it. And yet, here we are. ggplot2 was certainly very close, and I’m certain that someone knows how to fix the diagonal line issue. I suspect I could submit an issue ticket to Gadfly.jl to get the feature added to create custom legends (and for that matter, make the request of Plot.ly for horizontal legends), so in the future there could be feature parity using these two libraries as well.
I hope we all agree there’s no hope for Base Graphics in R besides quick throwaway plots.
In the end, the best thing I can say from this exercise is that the analytics community is fortunate to have so many talented people working to provide these amazing visualization libraries. This graph was rather pedestrian in nature, so I didn’t even scratch the surface of what these various libraries can do. Even beyond the six libraries I chose, there are others I didn’t choose, including: prettyplotlib (Python), Bokeh (Python), Vincent (Python), rCharts (R), ggvis (R), Winston (Julia), ASCII Plots (Julia) and probably even more that I’m not even aware of! All free and open-source and miles apart from terrible looking Microsoft graphics in Excel and Powerpoint.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.