Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
NYCDSA Bootcamp 1: February–April 2015, Day 3
As data scientists, we are quite familiar finding and mucking through data. We merge, split, clean, and analyze the data in order to draw our final conclusions. Our daily workflow can feel comfortingly logical, at times, even cut and dry. However, every now and again, we are reminded of the art of our craft. As journalism enters the age of data it is increasingly important to present data with visual impact. Resources like the New York Times present data in a visual and even interactive way, engaging the reader and enabling self-guided exploration.
Anyone who uses R should be familiar with the graphic created by Paul Butler in 2010 and included in the original Facebook IPO back in 2012. This was brought up in one of the first classes of the NYCDSA bootcamp as an example of the prowess of ggplot2, a popular graphics package for R. This is the kind of graphic that would inspire anyone to learn more about the features of this powerful language and associated package. In particular, three things speak to me:
- There was no mapping package: in this sense, the quantity of the connections themselves become an additional layer of information as we can clearly see most of the continents of the world mapped out by cities connected through the popular social network.
- It was done in R and ggplot2, without the aid of graphics design
- Great circle arcs provide an intuitive and evocative feeling of international travel (think of any major airline ad or even of the old school Indiana Jones traveling montages).
I was challenged to reproduce the look and feel of this plot using ggplot2. There is a plethora of resources online that I made use of to do this and probably many more that the reader can find if she wants to become more familiar with the capabilities of ggplot2. In particular, I draw heavily upon the tutorials by FlowingData and Spatial.ly. If you haven’t heard of them, please go check them out! They are fantastic resources and were very helpful. The rest of this post will focus on some of the elements above and how to reproduce them using R.
Reproducing a map without a map
Clearly, the Facebook universe of connections is vast enough to produce the plot above. In fact, in his original post, Butler notes his decision to plot only unique pairs of cities connected by Facebook friends rather than every every connection. “A big white blob appeared in the center of the map. Some of the outer edges of the blob vaguely resembled the continents, but it was clear that I had too much data to get interesting results just by drawing lines.” Without timely access to the breadth of worldwide Facebook connections, I decided to look at US airport locations and trips originating from some of the nation’s busiest airports to their domestic destinations. More detailed code for how I wrangled the airport data can be found at the link to the RPub presentation from class.
Great Circles
Having loaded the data into a table of start and end coordinates for each trip, I needed to calculate great circle arcs (traces of the largest circle that can be drawn between two points on a sphere) for each trip. For this, I used the geosphere package which takes two points and a step number as inputs and outputs the trace of the great circle between those two points.
gcIntermediate(nyc,sfo,n=50,addStartEnd=TRUE)
This would have to be done for every unique pair of combinations of airports to their domestic destinations. I used a for loop and created a path ID so that I could use ggplot2 to plot all the points.
<span class="identifier" style="color: #000000">routes</span><span class="operator" style="color: #687687"><-</span><span class="literal" style="color: #990073">NULL</span> <span class="keyword" style="-weight: bold;color: #990000">for</span><span class="paren" style="color: #687687">(</span><span class="identifier" style="color: #000000">i</span> <span class="keyword" style="-weight: bold;color: #990000">in</span> <span class="number" style="color: #009999">1</span><span class="operator" style="color: #687687">:</span><span class="identifier" style="color: #000000">nrow</span><span class="paren" style="color: #687687">(</span><span class="identifier" style="color: #000000">trips</span><span class="paren" style="color: #687687">)</span><span class="paren" style="color: #687687">)</span><span class="paren" style="color: #687687">{</span> <span class="identifier" style="color: #000000">gcirc</span><span class="operator" style="color: #687687"><-</span><span class="identifier" style="color: #000000">as.data.frame</span><span class="paren" style="color: #687687">(</span><span class="identifier" style="color: #000000">gcIntermediate</span><span class="paren" style="color: #687687">(</span> <span class="identifier" style="color: #000000">c</span><span class="paren" style="color: #687687">(</span><span class="identifier" style="color: #000000">trips</span><span class="operator" style="color: #687687">$</span><span class="identifier" style="color: #000000">lon.origin</span><span class="paren" style="color: #687687">[</span><span class="identifier" style="color: #000000">i</span><span class="paren" style="color: #687687">]</span>,<span class="identifier" style="color: #000000">trips</span><span class="operator" style="color: #687687">$</span><span class="identifier" style="color: #000000">lat.origin</span><span class="paren" style="color: #687687">[</span><span class="identifier" style="color: #000000">i</span><span class="paren" style="color: #687687">]</span><span class="paren" style="color: #687687">)</span>, <span class="identifier" style="color: #000000">c</span><span class="paren" style="color: #687687">(</span><span class="identifier" style="color: #000000">trips</span><span class="operator" style="color: #687687">$</span><span class="identifier" style="color: #000000">lon.dest</span><span class="paren" style="color: #687687">[</span><span class="identifier" style="color: #000000">i</span><span class="paren" style="color: #687687">]</span>,<span class="identifier" style="color: #000000">trips</span><span class="operator" style="color: #687687">$</span><span class="identifier" style="color: #000000">lat.dest</span><span class="paren" style="color: #687687">[</span><span class="identifier" style="color: #000000">i</span><span class="paren" style="color: #687687">]</span><span class="paren" style="color: #687687">)</span>, <span class="identifier" style="color: #000000">n</span><span class="operator" style="color: #687687">=</span><span class="number" style="color: #009999">50</span>,<span class="identifier" style="color: #000000">breakAtDateLine</span><span class="operator" style="color: #687687">=</span><span class="literal" style="color: #990073">F</span>,<span class="identifier" style="color: #000000">addStartEnd</span><span class="operator" style="color: #687687">=</span><span class="literal" style="color: #990073">TRUE</span><span class="paren" style="color: #687687">)</span><span class="paren" style="color: #687687">)</span> <span class="identifier" style="color: #000000">gcirc</span><span class="operator" style="color: #687687">$</span><span class="identifier" style="color: #000000">pathID</span><span class="operator" style="color: #687687"><-</span><span class="identifier" style="color: #000000">i</span> <span class="comment" style="-style: italic;color: #999988"># allowing for group plotting in ggplot</span> <span class="identifier" style="color: #000000">gcirc</span><span class="operator" style="color: #687687">$</span><span class="identifier" style="color: #000000">iata.origin</span><span class="operator" style="color: #687687"><-</span><span class="identifier" style="color: #000000">trips</span><span class="operator" style="color: #687687">$</span><span class="identifier" style="color: #000000">airport1</span><span class="paren" style="color: #687687">[</span><span class="identifier" style="color: #000000">i</span><span class="paren" style="color: #687687">]</span> <span class="identifier" style="color: #000000">routes</span><span class="operator" style="color: #687687"><</span><span class="operator" style="color: #687687"><-</span><span class="identifier" style="color: #000000">rbind</span><span class="paren" style="color: #687687">(</span><span class="identifier" style="color: #000000">routes</span>,<span class="identifier" style="color: #000000">gcirc</span><span class="paren" style="color: #687687">)</span> <span class="paren" style="color: #687687">}</span>
All that was left was to plot the results, colorizing the paths and airport locations in a night-time theme using some ggplot2 options:
<span class="keyword" style="-weight: bold;color: #990000">library</span><span class="paren" style="color: #687687">(</span><span class="identifier" style="color: #000000">ggplot2</span><span class="paren" style="color: #687687">)</span> <span class="keyword" style="-weight: bold;color: #990000">library</span><span class="paren" style="color: #687687">(</span><span class="identifier" style="color: #000000">grid</span><span class="paren" style="color: #687687">)</span> <span class="identifier" style="color: #000000">p</span><span class="operator" style="color: #687687"><-</span><span class="identifier" style="color: #000000">ggplot</span><span class="paren" style="color: #687687">(</span><span class="paren" style="color: #687687">)</span><span class="operator" style="color: #687687">+</span> <span class="identifier" style="color: #000000">xlim</span><span class="paren" style="color: #687687">(</span><span class="operator" style="color: #687687">-</span><span class="number" style="color: #009999">130</span>,<span class="operator" style="color: #687687">-</span><span class="number" style="color: #009999">60</span><span class="paren" style="color: #687687">)</span><span class="operator" style="color: #687687">+</span><span class="identifier" style="color: #000000">ylim</span><span class="paren" style="color: #687687">(</span><span class="number" style="color: #009999">20</span>,<span class="number" style="color: #009999">56</span><span class="paren" style="color: #687687">)</span><span class="operator" style="color: #687687">+</span> <span class="identifier" style="color: #000000">theme</span><span class="paren" style="color: #687687">(</span><span class="identifier" style="color: #000000">plot.margin</span> <span class="operator" style="color: #687687">=</span> <span class="identifier" style="color: #000000">unit</span><span class="paren" style="color: #687687">(</span><span class="identifier" style="color: #000000">c</span><span class="paren" style="color: #687687">(</span><span class="operator" style="color: #687687">-</span><span class="number" style="color: #009999">1</span>, <span class="operator" style="color: #687687">-</span><span class="number" style="color: #009999">1</span>, <span class="operator" style="color: #687687">-</span><span class="number" style="color: #009999">1</span>, <span class="operator" style="color: #687687">-</span><span class="number" style="color: #009999">1</span><span class="paren" style="color: #687687">)</span>, <span class="string" style="color: #dd1144">"cm"</span><span class="paren" style="color: #687687">)</span>, <span class="identifier" style="color: #000000">panel.background</span><span class="operator" style="color: #687687">=</span><span class="identifier" style="color: #000000">element_blank</span><span class="paren" style="color: #687687">(</span><span class="paren" style="color: #687687">)</span>, <span class="identifier" style="color: #000000">panel.grid.major</span> <span class="operator" style="color: #687687">=</span> <span class="identifier" style="color: #000000">element_blank</span><span class="paren" style="color: #687687">(</span><span class="paren" style="color: #687687">)</span>, <span class="identifier" style="color: #000000">panel.grid.minor</span> <span class="operator" style="color: #687687">=</span> <span class="identifier" style="color: #000000">element_blank</span><span class="paren" style="color: #687687">(</span><span class="paren" style="color: #687687">)</span>, <span class="identifier" style="color: #000000">plot.background</span><span class="operator" style="color: #687687">=</span><span class="identifier" style="color: #000000">element_rect</span><span class="paren" style="color: #687687">(</span><span class="identifier" style="color: #000000">fill</span><span class="operator" style="color: #687687">=</span><span class="string" style="color: #dd1144">"#3e4045"</span><span class="paren" style="color: #687687">)</span>, <span class="identifier" style="color: #000000">axis.line</span><span class="operator" style="color: #687687">=</span><span class="identifier" style="color: #000000">element_blank</span><span class="paren" style="color: #687687">(</span><span class="paren" style="color: #687687">)</span>, <span class="identifier" style="color: #000000">axis.text.x</span><span class="operator" style="color: #687687">=</span><span class="identifier" style="color: #000000">element_blank</span><span class="paren" style="color: #687687">(</span><span class="paren" style="color: #687687">)</span>, <span class="identifier" style="color: #000000">axis.text.y</span><span class="operator" style="color: #687687">=</span><span class="identifier" style="color: #000000">element_blank</span><span class="paren" style="color: #687687">(</span><span class="paren" style="color: #687687">)</span>, <span class="identifier" style="color: #000000">axis.ticks</span><span class="operator" style="color: #687687">=</span><span class="identifier" style="color: #000000">element_blank</span><span class="paren" style="color: #687687">(</span><span class="paren" style="color: #687687">)</span>, <span class="identifier" style="color: #000000">axis.title.x</span><span class="operator" style="color: #687687">=</span><span class="identifier" style="color: #000000">element_blank</span><span class="paren" style="color: #687687">(</span><span class="paren" style="color: #687687">)</span>, <span class="identifier" style="color: #000000">axis.title.y</span><span class="operator" style="color: #687687">=</span><span class="identifier" style="color: #000000">element_blank</span><span class="paren" style="color: #687687">(</span><span class="paren" style="color: #687687">)</span>,<span class="identifier" style="color: #000000">legend.position</span><span class="operator" style="color: #687687">=</span><span class="string" style="color: #dd1144">"none"</span> <span class="paren" style="color: #687687">)</span> <span class="identifier" style="color: #000000">airports</span><span class="operator" style="color: #687687"><-</span><span class="identifier" style="color: #000000">geom_point</span><span class="paren" style="color: #687687">(</span><span class="identifier" style="color: #000000">data</span><span class="operator" style="color: #687687">=</span><span class="identifier" style="color: #000000">airports</span><span class="paren" style="color: #687687">[</span><span class="identifier" style="color: #000000">airports</span><span class="operator" style="color: #687687">$</span><span class="identifier" style="color: #000000">country</span><span class="operator" style="color: #687687">==</span><span class="string" style="color: #dd1144">"USA"</span>,<span class="paren" style="color: #687687">]</span>,<span class="identifier" style="color: #000000">aes</span><span class="paren" style="color: #687687">(</span><span class="identifier" style="color: #000000">x</span><span class="operator" style="color: #687687">=</span><span class="identifier" style="color: #000000">long</span>,<span class="identifier" style="color: #000000">y</span><span class="operator" style="color: #687687">=</span><span class="identifier" style="color: #000000">lat</span><span class="paren" style="color: #687687">)</span>,<span class="identifier" style="color: #000000">col</span><span class="operator" style="color: #687687">=</span><span class="string" style="color: #dd1144">'#3d838a'</span>,<span class="identifier" style="color: #000000">size</span><span class="operator" style="color: #687687">=</span><span class="number" style="color: #009999">0.7</span><span class="paren" style="color: #687687">)</span> <span class="identifier" style="color: #000000">arcs</span><span class="operator" style="color: #687687"><-</span><span class="identifier" style="color: #000000">geom_path</span><span class="paren" style="color: #687687">(</span><span class="identifier" style="color: #000000">data</span><span class="operator" style="color: #687687">=</span><span class="identifier" style="color: #000000">as.data.frame</span><span class="paren" style="color: #687687">(</span><span class="identifier" style="color: #000000">routes</span><span class="paren" style="color: #687687">)</span>, <span class="identifier" style="color: #000000">aes</span><span class="paren" style="color: #687687">(</span><span class="identifier" style="color: #000000">x</span><span class="operator" style="color: #687687">=</span><span class="identifier" style="color: #000000">lon</span>, <span class="identifier" style="color: #000000">y</span><span class="operator" style="color: #687687">=</span><span class="identifier" style="color: #000000">lat</span>,<span class="identifier" style="color: #000000">group</span><span class="operator" style="color: #687687">=</span><span class="identifier" style="color: #000000">pathID</span>,<span class="identifier" style="color: #000000">color</span><span class="operator" style="color: #687687">=</span><span class="identifier" style="color: #000000">iata.origin</span><span class="paren" style="color: #687687">)</span>,<span class="identifier" style="color: #000000">alpha</span><span class="operator" style="color: #687687">=</span><span class="number" style="color: #009999">0.2</span>,<span class="identifier" style="color: #000000">size</span><span class="operator" style="color: #687687">=</span><span class="number" style="color: #009999">0.5</span><span class="paren" style="color: #687687">)</span>
This is a small start towards the beautiful piece that Paul created. I wanted to get a little creative, so I decided to take advantage of the vector output feature from R and ggplot2. I opened up Adobe Illustrator and added a subtle glow effect to each path. I even changed some of the colors, which is something I could have easily done in ggplot2, of course.
I don’ t show these plots to highlight any shortcomings of ggplot2. Rather, R output can be a beautiful thing (as Butler shows) or a great starting point for artistic effects that might aid impact.
Concluding Thoughts
I set out to see what would be required to achieve the neat effects in Paul Butler’s Facebook plot and covered a few of the basic elements that, when combined with larger data sets, could certainly provide a similar look and feel. While working on this project, however, I discovered a really nice blog post over at Spatial.ly, Improving R Data Visualization through Design. In it the author provides several examples of Raw R output and how his collaboration with a graphics designer enhanced the impact of the visual information without damaging the take-aways. While it is true that performing these sorts of post-processing steps on a raw plot outside of R violates some form of reproducibility of the figure itself, I feel that the embellishments discussed in the Spatial.ly article serve a higher purpose of making these informative visuals even more memorable.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.