Site icon R-bloggers

How do I re-arrange??: Ordering a plot revisited

[This article was first published on TRinker's R Blog » R, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Back in October of last year I wrote a blog post about reordering/rearanging plots. This was, and continues to be, a frequent question on list serves and R help sites. In light of my recent studies/presenting on The Mechanics of Data Visualization, based on the work of Stephen Few (2012, 2009), I realized I was remiss in explaining the ordering of variables from largest to smallest bar (particularly Cleveland Dot Plots and Bar Plots). It is often much more meaningful to arrange (order) factor levels by size of other numeric variable(s). This allows for easier pattern recognition over the standard alphabetic arrangement of levels.

The post will take you through a demonstration of sorting bars/points on another variable, however it assumes you already know how that if you want to reorder/rearrange in a plot you must reorder the factor levels (if you do not know this see this blog post). We then explore my GitHub package plotflow to add efficiency to re-leveling in the workflow. After we learn how to sort by bar/point size we will look at an applied use. I will use ggplot2 because this is my go to plotting system, however, these methods work with base and lattice plotting systems as well.

Click here for a .R file of the complete code found below.


Section 1: Reordering by Bar/Point Size

Create a data set we can alter

mtcars3 <-mtcars2 <-data.frame(car=rownames(mtcars), mtcars, row.names=NULL)
mtcars3$cyl  <-mtcars2$cyl <-as.factor(mtcars2$cyl)
head(mtcars2)

##                 car  mpg cyl disp  hp drat    wt  qsec vs am gear carb
## 1         Mazda RX4 21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
## 2     Mazda RX4 Wag 21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
## 3        Datsun 710 22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
## 4    Hornet 4 Drive 21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
## 5 Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
## 6           Valiant 18.1   6  225 105 2.76 3.460 20.22  1  0    3    1

An Example of Unordered Bars/Points

In this example it's difficult to find trends and patterns in the data.

library(ggplot2)
library(gridExtra)
x <-ggplot(mtcars2, aes(y=car, x=mpg)) + 
    geom_point(stat="identity")

y <-ggplot(mtcars2, aes(x=car, y=mpg)) + 
    geom_bar(stat="identity") + 
    coord_flip()

grid.arrange(x, y, ncol=2)

An Example of Ordered Bars/Points

Below we use the < face="courier">levels< > argument to factor in conjunction with order to order the levels of car by miles per gallon (mpg).

## Re-level the cars by mpg
mtcars3$car <-factor(mtcars2$car, levels=mtcars2[order(mtcars$mpg), "car"])

x <-ggplot(mtcars3, aes(y=car, x=mpg)) + 
    geom_point(stat="identity")

y <-ggplot(mtcars3, aes(x=car, y=mpg)) + 
    geom_bar(stat="identity") + 
    coord_flip()

grid.arrange(x, y, ncol=2)

This is an example when a factor's levels each has a unique row. This is not always the case. For instance if we want to use < face="courier">mtcars2cyl< > rather than mtcars2$car as the factor we'd have multiple observations for each cylinder level. In these instances we'd most likely utilize aggregate to summarize by a variable as seen in the ordering < face="courier">mtcars2$carb< > by average < face="courier">mpg< > below.

An Example of Ordered and Faceted Bars/Points

## Re-level the carb by average mpg
(ag_mtcars <-aggregate(mpg ~ carb, mtcars3, mean))

##   carb   mpg
## 1    1 25.34
## 2    2 22.40
## 3    3 16.30
## 4    4 15.79
## 5    6 19.70
## 6    8 15.00
mtcars3$carb <-factor(mtcars2$carb, levels=ag_mtcars[order(ag_mtcars$mpg), "carb"])

ggplot(mtcars3, aes(y=carb, x=mpg)) + 
    geom_point(stat="identity", size=2, aes(color=carb))

An Example of Ordered and Faceted Bars/Points

The last plot in this section adds faceting to further draw distinction and allow for pattern recognition. The ordering of the facets can also be changed by reordering factor levels in a way that is sensible for representing the narrative the data is telling.

ggplot(mtcars3, aes(y=car, x=mpg)) + 
    geom_point(stat="identity") +
    facet_grid(cyl~., scales = "free", space="free")

Recapping Section 1: Reordering by Bar/Point Size

In this first section we learned:

  1. Ordering factors by a numeric variable increases the ability to recognize patterns
  2. We can have (a) one row per factor level or (b) multiple rows per factor level.
    • The first scenario requires feeding the dataframe with the < face="courier">levels< > reordered through order.
    • The second scenario requires some sort of aggregation by summary statistic before using order and feeding to the < face="courier">levels< > argument of factor.
  3. Adding faceting can increase the ability to further find patterns among the ordered figure.

Section 2: Speeding Up the Workflow With the plotflow Package

Because I have the need to reorder factors by other numeric variables frequently and using order and sometimes aggregate is tedious and annoying I have wrapped this process up as a function called order_by in the plotflow package. I pretty much ripped off the entire function from Thomas Wutzler. This function allows the user to sort a dataframe by 1 or more numeric variables and return the new dataframe with a re-leveled factor. This is useful in that a new dataframe is created rather than tampering with the original. The function also allows for a summery stat to be passed via the < face="courier">FUN< > argument in a similar fashion as aggregate. This approach save typing and is more intuitive.

Getting the plotflow package

To get plotflow you can install the devtools package and use the < face="courier">install_github< > function:

# install.packages("devtools")

library(devtools)
install_github("plotflow", "trinker")

What Does order_by do?

library(plotflow)
dat <-aggregate(cbind(mpg, hp, disp)~carb, mtcars, mean)
dat$carb <-factor(dat$carb)

## compare levels (data set looks the same though)
dat$carb

## [1] 1 2 3 4 6 8
## Levels: 1 2 3 4 6 8

order_by(carb, ~-hp + -mpg, data = dat)$carb

## [1] 1 2 3 4 6 8
## Levels: 8 4 3 6 2 1

By default order_by returns a dataframe however we can also tell order_by to return a vector by setting < face="courier">df=FALSE< >.

## Return just the vector with new levels
order_by(carb, ~ -hp + -mpg, dat, df=FALSE)

## [1] 1 2 3 4 6 8
## Levels: 8 4 3 6 2 1

Let's see order_by in action.

Use order_by to Order Bars

library(ggplot2)

## Reset the data from Section 1
dat2 <-data.frame(car=rownames(mtcars), mtcars, row.names=NULL)
ggplot(order_by(car, ~ mpg, dat2), aes(x=car, y=mpg)) + 
    geom_bar(stat="identity") + 
    coord_flip() + ggtitle("Order Pretty Easy")

Aggregated by Summary Stat

Carb Ordered By Summary (Mean) of mpg

## Ordered points with the order_by function
a <-ggplot(order_by(carb, ~ mpg, dat2, mean), aes(x=carb, y=mpg)) +
    geom_point(stat="identity", aes(colour=carb)) +
    coord_flip() + ggtitle("Ordered Dot Plots Made Easy")

## Reverse the ordered points
b <-ggplot(order_by(carb, ~ -mpg, dat2, mean), aes(x=carb, y=mpg)) +
    geom_point(stat="identity", aes(colour=carb)) +
    coord_flip() + ggtitle("Reverse Order Too!")

grid.arrange(a, b, ncol=1)

Nested Usage (order_by on an order by dataframe)

ggplot(order_by(gear, ~mpg, dat2, mean), aes(mpg, carb)) +
    geom_point(aes(color=factor(cyl))) +
    facet_grid(gear~., scales="free") + ggtitle("I'm Nested (Yay for me!)")

The order_by function makes life a little easier.


Section 3: Using order_by on Real Data

Now I turn the attention to a real life usage of ordering a factor by a numeric variable in order to see patterns. A while back Abraham Mathew presented a blog post utilizing some interesting data on job satisfaction within bigger technology companies. His demonstrations showed various ways to utilize ggplot2 to visualize the data.

As I read the post I was also reading a bit of Stephen Few's work, which recommends ordering bars/dotplots to better see patterns. This visualization, which Mathew produced with ggplot2, is captivating:

However, I believed that by order the bars as Stephen Few (2012); Few (2009) suggests may enhance our ability to see a pattern; which of the four variables are linked?

In this next section we'll grab the data, clean it, reshape it, re-level the factors and plot in a more meaningful way to reveal patterns not seen before. Let's begin by loading the following packages:

library(RCurl)
library(XML)
library(rjson)
library(ggplot2)
library(qdap)
library(reshape2)
library(gridExtra)

Now we can scrape the data and extract the required pieces.

URL <-"http://www.payscale.com/top-tech-employers-compared-2012/job-satisfaction-survey-data"
doc   <-htmlTreeParse(URL, useInternalNodes=TRUE)
nodes <-getNodeSet(doc, "//script[@type='text/javascript']")[[19]][[1]]
dat <-gsub("];", "]", capture.output(nodes)[5:27])
ndat <-data.frame(do.call(rbind, fromJSON(paste(dat, collapse = ""))))[, -2]
ndat[, 1:5] <-lapply(ndat, unlist)
IBM <-grepl("International Business Machines", ndat[, 1])
ndat[IBM, 1] <-bracketXtract(ndat[IBM, 1])
ndat[, 1] <-sapply(strsplit(ndat[, 1], "\\s|,"), "[", 1)

At this point we re-level the factor level < face="courier">Employer.Name< > by job satisfaction.

## Re-level with order_by
ndat[, "Employer.Name"] <-order_by(Employer.Name, ~Job.Satisfaction, ndat, df=FALSE)
colnames(ndat)[1] <-"Employer"
ndat

##           Employer Job.Satisfaction Work.Stress Job.Meaning Job.Flexibility
## 1            Adobe           0.6875      0.7031      0.4532          0.8594
## 2       Amazon.com           0.7723      0.7010      0.4901          0.7376
## 3              AOL           0.7714      0.6572      0.4118          0.7714
## 4            Apple           0.7800      0.6510      0.7114          0.7567
## 5             Dell           0.6890      0.6275      0.4983          0.8712
## 6             eBay           0.7097      0.6087      0.5824          0.8153
## 7         Facebook           0.8750      0.6875      0.8125          0.9375
## 8           Google           0.7987      0.5660      0.6387          0.8334
## 9  Hewlett-Packard           0.5807      0.6034      0.4335          0.8733
## 10           Intel           0.7339      0.6677      0.6892          0.8896
## 11             IBM           0.6414      0.6637      0.4631          0.8946
## 12        LinkedIn           1.0000      0.6923      0.8462          0.9166
## 13       Microsoft           0.6777      0.6181      0.6099          0.9281
## 14     Monster.com           0.7273      0.8181      0.5454          0.8181
## 15           Nokia           0.7400      0.4800      0.5600          0.8200
## 16          Nvidia           0.7692      0.5897      0.5385          0.7692
## 17          Oracle           0.6713      0.6406      0.4221          0.9218
## 18  Salesforce.com           0.8667      0.7334      0.6667          0.8275
## 19         Samsung           0.6596      0.7447      0.6595          0.6170
## 20            Sony           0.7500      0.6667      0.5217          0.8750
## 21          Yahoo!           0.6762      0.5333      0.5145          0.8750

Now we can reshape the data to long format which ggplot2 prefers almost exclusively.

## Melt the data to long format
mdat <-melt(ndat)
mdat[, 2] <-factor(gsub("\\.", " ", mdat[, 2]), 
    levels = gsub("\\.", " ", colnames(ndat)[-1]))

head(mdat)

##     Employer         variable  value
## 1      Adobe Job Satisfaction 0.6875
## 2 Amazon.com Job Satisfaction 0.7723
## 3        AOL Job Satisfaction 0.7714
## 4      Apple Job Satisfaction 0.7800
## 5       Dell Job Satisfaction 0.6890
## 6       eBay Job Satisfaction 0.7097

Now our data is cleaned and reshaped with Employer re-leveled by job satisfaction. I chose this (job satisfaction) as the variable of interest because of literature I've read around job performance, teacher retention and job satisfaction. Let's see if re-leveling the factor has an improvement on the trends and patterns we can see.

ggplot(data=mdat, aes(x=Employer, y=value, fill=factor(Employer))) + 
  geom_bar(stat="identity") + coord_flip() + ylim(c(0, 1)) + 
  facet_wrap( ~ variable, ncol=2) + theme(legend.position="none") + 
  ggtitle("Plot 3: Employee Job Satisfaction at Top Tech Companies") +
  ylab(c("Job Satisfaction"))

The first thing I noticed after the reordering is that Job Meaning and Job Satisfaction appear to be related. In general, higher satisfaction corresponds with greater meaning. I also noticed that Flexibility and Stress do not appear to correspond with satisfaction. This made me curious and so I ran a simple regression model with Satisfaction as the outcome and the other three variables as predictors. The story from the regression model is similar to the visualization.

mod <-lm(Job.Satisfaction ~ Work.Stress + Job.Meaning + Job.Flexibility, data=ndat)

anova(mod)

## Analysis of Variance Table
## 
## Response: Job.Satisfaction
##                 Df Sum Sq Mean Sq F value Pr(>F)    
## Work.Stress      1 0.0069  0.0069    1.45 0.2452    
## Job.Meaning      1 0.0816  0.0816   17.04 0.0007 ***
## Job.Flexibility  1 0.0006  0.0006    0.13 0.7260    
## Residuals       17 0.0814  0.0048                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

summary(mod)

## 
## Call:
## lm(formula = Job.Satisfaction ~ Work.Stress + Job.Meaning + Job.Flexibility, 
##     data = ndat)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.12043 -0.03002 -0.00263  0.03268  0.11915 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       0.3101     0.2413    1.29   0.2160    
## Work.Stress       0.1062     0.2147    0.49   0.6273    
## Job.Meaning       0.5241     0.1288    4.07   0.0008 ***
## Job.Flexibility   0.0733     0.2058    0.36   0.7260    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.0692 on 17 degrees of freedom
## Multiple R-squared:  0.523,  Adjusted R-squared:  0.438 
## F-statistic: 6.21 on 3 and 17 DF,  p-value: 0.00483

The model accounts for ~50% of the variability in Job Satisfaction. While the model is significant there clearly is more than just Meaning that impacts Satisfaction. I Decided to do a bit more plotting and use the preattentive attributes of color and size to represent Flexibility and Stress in the visual model.

theplot <-ggplot(data=ndat, aes(x = Job.Meaning, y = Job.Satisfaction)) + 
    geom_smooth(method="lm", fill = "blue", alpha = .1, size=1) +  
    geom_smooth(color="red", fill = "pink", alpha = .3, size=1) +
    xlim(c(.4, .9)) +
    geom_point(aes(size = Job.Flexibility, colour = Work.Stress)) +
    geom_text(aes(label=Employer), size = 3, hjust=-.1, vjust=-.1) +
    scale_colour_gradient(low="gold", high="red") 

theplot

There is certainly a pull on the smoothed line by this group of tech companies, circled below, that may be an unaccounted variable in the model.

theplot + annotation_custom(grob=circleGrob(r = unit(.4,"npc")), xmin=.47, xmax=.57, ymin=.72, ymax=.82)

If we view the data as two separate smoothed regression lines we get a more predictable model. This indicates a variable that we have not included.

ndat$outs <-1
ndat$outs[ndat$Employer %in% qcv(AOL, Amazon.com, Nvidia, Sony)] <-0

ggplot(data=ndat, aes(x = Job.Meaning, y = Job.Satisfaction)) + 
    geom_smooth(method="lm", fill = "blue", alpha = .1, size=1, aes(group=outs)) +  
    geom_smooth(color="red", fill = "pink", alpha = .3, size=1) +
    xlim(c(.4, .9)) +
    geom_point(aes(size = Job.Flexibility, colour = Work.Stress)) +
    geom_text(aes(label=Employer), size = 3, hjust=-.1, vjust=-.1) +
    scale_colour_gradient(low="gold", high="red") 


We've learned:

  1. Re-leveling/re-ordering a factor by a numeric variable(s) can lead to important pattern detection in data.
  2. The < face="courier">levels< > argument to factor is key to the reordering.
  3. order and sometimes aggregate allows the re-leveling to occur.
  4. The order_by function in the plotflow package can make re-leveling easier.
  5. Faceting can amplify the distinction made by the re-leveling.

< size="2">*Created using the reports (Rinker, 2013) package< >


References


To leave a comment for the author, please follow the link and comment on their blog: TRinker's R Blog » R.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.