A Quick and Dirty Guide to Exploratory Data Visualization

Posted on July 20, 2013 by Cory Lesmeister in R bloggers | 0 Comments

[This article was first published on Fear and Loathing in Data Science, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

One of the things I’ve noticed teaching statistical fundamentals or working with colleagues is the lack of focus on first visually exploring the data. Novices seem to want to jump right in with correlations and statistical tests without getting a “feel” of what they are examining. The Germans have an appropriate term I think in “fingerspitzengefuhl”, which literally means “finger tips feeling”. Visualization of the data can provide this and that is a selling point of using R. The community comments and packages available are seemingly endless. My goal in this post is to examine, at a high-level, the Lattice and vcd packages. The data set is based on the world’s largest metaphor hitting an iceberg; that’s right, the Titanic.

I downloaded the data set from kaggle.com and it is part of one of their competitions. It consists of the following variables:

survived = did the passenger survive or not

plcass = passenger class

name = passenger name

sex = passenger sex

age = passenger age

sibsp = the number of siblings/spouses aboard

parch = the number of parents/children aboard

ticket = ticket number

fare = passenger fare

cabin = passenger cabin number

embarked = Port of Embarkation; either Cherbourg, Southampton or Queenstown

home.dest = passenger home and eventual destination

The contest is seeking the model that best predicts passenger survival (variable survived) and the website offers several tutorials to get contestants started. This is an interesting data set and one I think is open to provide examples of the power of simple data visualization.

I’ve loaded the data into R, calling it titan1 and we can see it consists of 1309 observations.

> str(titan1)

‘data.frame’: 1309 obs. of 12 variables:

$ survived : Factor w/ 2 levels “dead”,”survive”: 2 2 1 1 1 2 2 1 2 1 …

$ pclass : Factor w/ 3 levels “1st”,”2nd”,”3rd”: 1 1 1 1 1 1 1 1 1 1 …

$ sex : Factor w/ 2 levels “female”,”male”: 1 2 1 2 1 2 1 2 1 2 …

$ age : num 29 0.917 2 30 25 …

$ sibsp : int 0 1 1 1 1 0 1 0 2 0 …

$ parch : int 0 2 2 2 2 0 0 0 0 0 …

$ fare : num 211 152 152 152 152 …

$ embarked : Factor w/ 4 levels “”,”C”,”Q”,”S”: 4 4 4 4 4 4 4 4 4 2 …

$ ticket : Factor w/ 929 levels “110152”,”110413″,..: 188 50 50 50 50 125 93 16 77 826 …

$ cabin : Factor w/ 187 levels “”,”A10″,”A11″,..: 45 81 81 81 81 151 147 17 63 1 …

$ home.dest: Factor w/ 370 levels “”,”?Havana, Cuba”,..: 310 232 232 232 232 238 163 25 23 230 …

$ name : Factor w/ 1307 levels “Abbing, Mr. Anthony”,..:

Notice that embarked is telling us it has 4 levels, including missing data (“”). This is a pesky problem with factors, which took me a while to figure out how to get rid of. I believe the easiest way to deal with it is when loading the .csv to have as an option(stringsAsFactors = FALSE).

We will delete the missing observations, but first let’s get rid of variables I’m not interested in (ticket, cabin, home.dest and name).

> titan2 = titan1[c(-9,-10,-11,-12)] #create a subset by dropping variables

> names(titan2)

[1] “survived” “pclass” “sex” “age” “sibsp” “parch” “fare” “embarked”

> which(titan2$embarked == “”) #find those pesky “” in embarked

[1] 169 285

> levels(titan2$embarked) = c(NA, “C”, “Q”, “S”) #replace “” with NA, again stringsAsFactors = FALSE is better during data upload

> levels(titan2$embarked)

[1] “C” “Q” “S”

> str(titan2) #confirm above

‘data.frame’: 1309 obs. of 8 variables:

$ survived: Factor w/ 2 levels “dead”,”survive”: 2 2 1 1 1 2 2 1 2 1 …

$ pclass : Factor w/ 3 levels “1st”,”2nd”,”3rd”: 1 1 1 1 1 1 1 1 1 1 …

$ sex : Factor w/ 2 levels “female”,”male”: 1 2 1 2 1 2 1 2 1 2 …

$ age : num 29 0.917 2 30 25 …

$ sibsp : int 0 1 1 1 1 0 1 0 2 0 …

$ parch : int 0 2 2 2 2 0 0 0 0 0 …

$ fare : num 211 152 152 152 152 …

$ embarked: Factor w/ 3 levels “C”,”Q”,”S”: 3 3 3 3 3 3 3 3 3 1

> titan3 = na.omit(titan2) # delete missing observations

> str(titan3) #structure of the data ready for analysis

‘data.frame’: 1043 obs. of 8 variables:

$ survived: Factor w/ 2 levels “dead”,”survive”: 2 2 1 1 1 2 2 1 2 1 …

$ pclass : Factor w/ 3 levels “1st”,”2nd”,”3rd”: 1 1 1 1 1 1 1 1 1 1 …

$ sex : Factor w/ 2 levels “female”,”male”: 1 2 1 2 1 2 1 2 1 2 …

$ age : num 29 0.917 2 30 25 …

$ sibsp : int 0 1 1 1 1 0 1 0 2 0 …

$ parch : int 0 2 2 2 2 0 0 0 0 0 …

$ fare : num 211 152 152 152 152 …

$ embarked: Factor w/ 3 levels “C”,”Q”,”S”: 3 3 3 3 3 3 3 3 3 1 …

– attr(*, “na.action”)=Class ‘omit’ Named int [1:266] 16 38 41 47 60 70 71 75 81 107 …

.. ..- attr(*, “names”)= chr [1:266] “16” “38” “41” “47” …

OK, I shall now put the lattice package through its paces, putting together a number of trellis plots.

> library(lattice) #load the package

> trellis.device() #this is called the trellis aware device

Trellis plots via lattice are an effective way to display multivariate data. The code follows the format of…graphtype(formula, data=…).

Let’s do a simple boxplot on age by passenger survival.

> boxplot(age~survived, ylab= “Passenger Age”, data=titan3)

> xyplot(pclass~age | survived, data=titan3) #this plot looks at class by age “conditioned” on survival

Looks like the youth in 1st and 2nd class stood a much better chance of survival than in 3rd class

> dotplot(survived~age | pclass, data=titan3) #trying dotplot to examine this in a different way

After much trial and error I find this plot to be informative. Females in 1st and 2nd class seemed to have a much better chance of survival than any other group. Of the males, only 1st and 2nd class youth stood much of a chance.

> xyplot(age~survived | sex * pclass, data=titan3)

> bwplot(age~survived | pclass, data=titan3, layout=c(3,1)) #apparent visual confirmation; you could condition by sex also

I really like mosaic plots (leftover from my JMP days) in looking at nominal data. You can use the what comes in the standard R package.

mosaicplot(~survived + pclass, data=titan3, color=TRUE) #2 factors examined in a mosaic plot; you can change colors e.g. color=3:4 etc.

> library(vcd) #trying something new in mosaic plots by using the vcd package

mosaic(~ survived + sex | pclass, data=titan3, main = “Titanic Survival”, shade = TRUE, legend = TRUE) #plot conditioned by pclass; much better ‘eh?

spine(survived~age, data=titan3, breaks=8) #spine chart in vcd; 1 factor and 1 numeric variable broken into 8 intervals with the breaks argument.

This just scratches the surface (haven’t used ggplot yet). I have tried generalized pair plots using GGally, but haven’t found them that insightful once you get over 4 or 5 variables. I would appreciate any further recommendations and tips/tricks on visualization with R.

T.D. Meister

To leave a comment for the author, please follow the link and comment on their blog: Fear and Loathing in Data Science.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

R-bloggers

R news and tutorials contributed by hundreds of R bloggers

A Quick and Dirty Guide to Exploratory Data Visualization

Related

Related

Never miss an update! Subscribe to R-bloggers to receive e-mails with the latest R posts. (You will not see this message again.)

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)