R Training – Data Visualization

jprimav

5 years ago

[This article was first published on R – SLOW DATA, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

This is the fourth module of a five-module training on R I conceived and taught to my ex-colleagues back in December 2016. RStudio is the suggested IDE to go through this module. For the raw code, example data and resources visit this repo on GitHub.

Graphics in data projects can be useful for several tasks including:

understand data properties
find patterns in data
communicate results

First of all, let’s load some useful packages.

library(dplyr)
library(MASS)
library(rworldmap)
library(ggplot2)
library(RColorBrewer)

Understand data properties

We will start with some exploratory graphics to summarize data and highlight broad features. This is useful to explore basic questions and hypothesis, suggest modeling strategies and so on.

x <- rnorm(100)
y <- x + rnorm(100, mean=0.2, sd=2)
df <- data.frame(lab = LETTERS[1:7], g = rgamma(7, shape = 100))

plot( ) is a generic function to plot R objects. It is generic because it adapts to the input provided:

if you provide a numeric vector the default is to plot them as points on the y axis against an integer index on x axis
if you provide two numeric vectors the default is to plot the points determined by the (x,y) couples (a scatterplot)
if you provide a dataframe with a numerical and a factor you will get a barplot

plot(x) # values against integer index

plot(x, y) # scatterplot

plot(df) # barplot

Let’s load some example data

data("iris")
dt <- iris
str(dt) # to have an idea of what kind of data you have read-in

summary(dt) # to obtain a summary of data

Once you know your data is clean you may want to explore some features more in detail.

dt_spec <- dt %>% group_by(Species) %>% summarise(Petal.Length=sum(Petal.Length))
plot(dt_spec)

There is also a specific function to create barplots in R, but input have to be provided in a slightly different way:

barplot(dt_spec$Petal.Length, names.arg = dt_spec$Species)

Looking at the summary we see that minimum sepal length is 4.3, maximum 7.9 and median 5.8. We have also other quantiles but to have a more thorough view of the distribution you should draw a histogram.

hist(dt$Sepal.Length)

hist(dt$Sepal.Length, nclass = 30) # to smooth more by increasing number of bins

Another way to get a quick visualization of a distribution is to use boxplots.

boxplot(dt$Sepal.Length)

In this case we see clearly that:

the bulk of distribution (50%) has a value around 5 and 6.5
maximum value excluding outliers is somewhere between 7.5 and 8.0
right tail is longer than left tail

In R-boxplots the box correspond to the interquartile range (from 25th to 75th quantile), black line inside the box is the median, the lines extending vertically from the box (whiskers) indicate variability outside the upper and lower quartile. Outliers are plotted as individual points (if any).

Find patterns

Usually it is a good idea to investigate relations using graphics since we are naturally prone to detect trends, relationships, etc. in a visual way.

When we talk about patterns in data we usually refer to relationships between two or more variables. Options to visualize two dimnensions are:

draw multiple boxplots in one window
scatterplots
etc.

To add a 3rd dimension one option is to use different colors, shapes, sizes, etc. (rather than using 3D graphics, which are typically hard to interpret).

Say we want to see if age distribution changes according to car category.

# boxplot function supports formula (~) statements
boxplot(dt$Sepal.Length ~ dt$Species, col="salmon2")

The hist() function does not support the formula statment, but you can modify directly the global graphical parameteres in order to split the graphical device into multiple slots. Before changing global parameters it is a good idea to save a copy of original settings in order to easily go back to defaults once done with the plot.

parOriginal <- par(no.readonly = TRUE) # save a copy of original graphical parameters
par(mfrow=c(2,2)) # par can be used to set or query graphical parameters
hist(dt[dt$Species=="setosa","Sepal.Length"], nclass = 30)
hist(dt[dt$Species=="virginica","Sepal.Length"], nclass = 30)
hist(dt[dt$Species=="versicolor","Sepal.Length"], nclass = 30)
hist(dt$Sepal.Length, nclass = 30) # full age distribution

par(parOriginal) # set default graphical parameters

Scatterplot

Let’s simulate some numbers and draw scatterplots.

# two normal populations, with mean 2 and 4 respectively 
x_a <- rnorm(50, 2)
x_b <- rnorm(50, 4)
x <- c(x_a, x_b)

# another two normal populations respectively correlated with previous ones
y_a <- x_a + rnorm(50, 0.2, 0.5)
y_b <- x_b + rnorm(50, 0.2, 1)
y <- c(y_a, y_b)

# a variable to label the two populations
l <- c(rep("A", 50), rep("B", 50))

# a dataframe including x, y and l
df <- data.frame(x=x, y=y, l=l)

# scatterplot 2-d
plot(df$x, df$y)

# add a third dimension with colour
with(df, plot(x, y, col = l))

Spatial analysis

If you are interested in the visualization of a geographical attribute then a map is probably what you need. R can be used as a fast, user-friendly and extremely powerful command-line Geographic Information System (GIS).

In R there is a large and growing number of spatial data packages. Here we will focus on rworldmap, a package for visualising global data referenced by country.

The package stores multiple maps which can be accessed through getMap function.

newmap <- getMap(resolution = "coarse")  
class(newmap)

Maps in R are classified as spatial (sp) objects. Spatial objects are made up of a number of different slots (that can be accessed through the @ operator):

bbox (bounding box, mostly used for setting up plots)
data (data indeed)
polygons/lines/points/… (the geometry instructing R on how to plot maps)
proj4string (define the coordinate reference system)

Inside each slot you may have multiple components which, as usual, can be accessed with the $ operator.

Plot is a generic function and it works also with spatial objects.

To add some information in this map we need some attribute at country-level. The package rworldmap itself offers some interesting environmental dataset.

The package rworldmap provides a function to join country-level attributes to an internal map. All you need to do is to provide the name of the column containing the key for join (nameJoinColumn = ‘ISO3V10’) and specify you want to join by that key (joinCode = ‘ISO3’)

dat <- joinCountryData2Map(countryExData, joinCode = "ISO3", nameJoinColumn = "ISO3V10")

Function mapCountryData in rworldmap draws a map of country-level data, allowing countries to be coloured.

mapCountryData(dat, nameColumnToPlot="BIODIVERSITY")

Using spatial data in R can be challenging because there are many types and formats and there are many packages coming from diverse user communities. Anyway there is an increasing trend of harmonization and the capabilities offered are extremely vast. A good start is the CRAN tutorial, or one of the many tutorials on github.

Communicate results

Typically the findings of a data analysis are shared with an audience and in general visual aids help people to digest complex messages. In this context the sizes, shapes, widths, labels, margins, s, etc. are all things that become important because they can contribute to make the visualization clearer.

Additional graphical parameters

When applicable plot function allows you to specify many additional graphical parameters. To have a list of them type ?par

Let’s take the histogram created before and clean it a bit with additional graphical parameters.

hist(dt$Sepal.Length, 
     nclass = 30, # number of bins
     probability = TRUE, 
     col="wheat", # color of bars
     border = "black", # color of border of bars
     xlab = "Sepal Length", # label of x axis
     ylab = "", # label of y axis
     main = "Iris Sepal Length density distribution" # title
     ) 

fit <- fitdistr(dt$Sepal.Length, "normal") # Maximum-likelihood fitting of univariate distributions
curve(dnorm(x, mean = fit$estimate["mean"], sd = fit$estimate["sd"]), add=T, col = "red") # Draws a curve corresponding to a function


# Also legends can be added

legend("topright", # position of legend box
       bty = "n", # box type = none
       legend = c("Observed", "theoretical normal"), # text to be displayed
       col = c("wheat", "red"), # colors
       lty = c(1,1), # line type 
       lwd = c(10, 1) # line width
       )

Ggplot

All functions used until now belong to the base plotting systems. In R there are 3 different plotting systems available:

base
lattice
ggplot

ggplot is an implementation of the Grammar of Graphics by Leland Wilkinson (a set of principles for graphics). Grammar of graphics is a description of how graphics can be broken down into abstract concepts (like languages are divided in nouns, adjectives, etc.). Ggplot graphics abstraction is a very powerful concept to organize all kind of graphics and has become extremely popular in recent years.

Ggplot2, as lattice, is built upon the grid package which is able to control all details of the graphic system in R. This is why ggplot allows you to produce a wide variety of visualizations virtually according to every needs and purpose. For the same reason ggplot is typically the first choice for high-quality works in R, ready to publish.

Briefly, from the ggplot book,

the grammar tells us that a statistical graphics is a mapping from data to aesthetic attributes (colour, shape, size) of geometric objects (points, lines, bars). The plot may also contain statistical transformations of the data and is drawn on a specific coordinate system.

Another key feature of ggplot graphics is that they are built with layers and this explain the sum symbol (+) you will see in the code.

# if you are using a Windows machine you need to translate some  for ggplot
windowsFonts(Times=windowsFont("TT Times New Roman"))

gg1 <-ggplot(dt, aes(x = Sepal.Length, group = Species, fill = Species)) + # set Sepal Length on x-axis, group and fill (with color) according to the values of Species
  geom_density(alpha = .4) + # transform age data into a density distribution summary 
  xlab("Sepal Length") + # set x-axis label
  ylab("") + # set y-axis label
  ggtitle("Sepal Length distributions by Species") + # set plot title
  guides(fill=guide_legend("Species")) + # color legend according to values of species
  theme(plot.title = element_text(hjust = 0, vjust=5, size = 14, family = "Times"), # set position, size and  for title
        axis.text.x = element_text(size = 12, family = "Times"), # set size and  for x axis label
        axis.text.y = element_text(size = 12, family = "Times"), # set size and  for y axis label
        panel.background = element_rect(fill = "white") # set background color
        )
     
gg1 # to plot ggplot plots you have to call them

Colours

A careful choice of colors can help to draw better visualizations. R has 657 built-in color names. Use colors() for a list of all colors known by R.

When we need to show a range of colors we can use palettes. In the map created before the palette was not specified so mapCountryData function used its default value (in that case a heat palette, with colors ranging gradually from yellow to red). We can customize palettes to our needs.

A reference package for color palettes is RColorBrewer. The function to create palettes is brewer.pal. It takes two arguments:

n –> Number of different colors in the palette, minimum 3, maximum depending on palette
name –> a palette name

To have a look at all available palettes you can use:

display.brewer.all(n=NULL, type="all") # diverging, sequential, qualitative

display.brewer.all(n=NULL, type="seq") # only sequential

For an interactive viewer of palettes you can visit this page.

# using output from RColorBrewer
mapCountryData(dat, nameColumnToPlot="BIODIVERSITY",
               colourPalette = brewer.pal(7, "Purples"))

Graphical devices

Once your nice plot is completed you may want to export it for reporting purpose. There are many graphic devices in R. A graphic device is something where you can make a plot appear:

a window on your computer (screen device)
a PDF file (file device)
a PNG or JPEG (file device)
a scalable vector graphics (SVG) file (file device)

When you make a plot in R it has to be “sent” to a specific graphic device. The most common place to be sent is the screen. On Mac screen device is launched with the quartz(), in windows with windows(), on Unix/Linux with x11().

Functions like plot(), hist(), ggplot() they all have screen as default device. If you want to send the graphics to a device different from screen you have to:

explicitly launch a graphic device
call a plotting function to make a plot (note that if you are using a file device no plot will appear on the screen!)
annotate plot if necessary (add legends, etc.)
explicitly close the graphics device with dev.off()

# save the ggplot in pdf
pdf(file = "myplot.pdf")
gg1
dev.off()

# save the ggplot in PNG
png(file = "myplot.PNG")
gg1
dev.off()

R graphical capabilities are enormous and we have only scratched the surface. To get inspired consider have a tour in R graph gallery.

That’s it for this module! If you have gone through all this code you should have learnt the basics of R graphical capabilities.

The post R Training – Data Visualization appeared first on SLOW DATA.

To leave a comment for the author, please follow the link and comment on their blog: R – SLOW DATA.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.