Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Back in October of last year I wrote a blog post about reordering/rearanging plots. This was, and continues to be, a frequent question on list serves and R help sites. In light of my recent studies/presenting on The Mechanics of Data Visualization, based on the work of Stephen Few (2012, 2009), I realized I was remiss in explaining the ordering of variables from largest to smallest bar (particularly Cleveland Dot Plots and Bar Plots). It is often much more meaningful to arrange (order) factor levels by size of other numeric variable(s). This allows for easier pattern recognition over the standard alphabetic arrangement of levels.
The post will take you through a demonstration of sorting bars/points on another variable, however it assumes you already know how that if you want to reorder/rearrange in a plot you must reorder the factor levels (if you do not know this see this blog post). We then explore my GitHub package plotflow to add efficiency to re-leveling in the workflow. After we learn how to sort by bar/point size we will look at an applied use. I will use ggplot2 because this is my go to plotting system, however, these methods work with base and lattice plotting systems as well.
Click here for a .R file of the complete code found below.
Section 1: Reordering by Bar/Point Size
Create a data set we can alter
mtcars3 <-mtcars2 <-data.frame(car=rownames(mtcars), mtcars, row.names=NULL) mtcars3$cyl <-mtcars2$cyl <-as.factor(mtcars2$cyl) head(mtcars2) ## car mpg cyl disp hp drat wt qsec vs am gear carb ## 1 Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4 ## 2 Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4 ## 3 Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1 ## 4 Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1 ## 5 Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2 ## 6 Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
An Example of Unordered Bars/Points
In this example it's difficult to find trends and patterns in the data.
library(ggplot2) library(gridExtra) x <-ggplot(mtcars2, aes(y=car, x=mpg)) + geom_point(stat="identity") y <-ggplot(mtcars2, aes(x=car, y=mpg)) + geom_bar(stat="identity") + coord_flip() grid.arrange(x, y, ncol=2)
An Example of Ordered Bars/Points
Below we use the < face="courier">levels< > argument to factor in conjunction with order to order the levels of car by miles per gallon (mpg).
## Re-level the cars by mpg mtcars3$car <-factor(mtcars2$car, levels=mtcars2[order(mtcars$mpg), "car"]) x <-ggplot(mtcars3, aes(y=car, x=mpg)) + geom_point(stat="identity") y <-ggplot(mtcars3, aes(x=car, y=mpg)) + geom_bar(stat="identity") + coord_flip() grid.arrange(x, y, ncol=2)
This is an example when a factor's levels each has a unique row. This is not always the case. For instance if we want to use < face="courier">mtcars2cyl< > rather than mtcars2$car as the factor we'd have multiple observations for each cylinder level. In these instances we'd most likely utilize aggregate to summarize by a variable as seen in the ordering < face="courier">mtcars2$carb< > by average < face="courier">mpg< > below.
An Example of Ordered and Faceted Bars/Points
## Re-level the carb by average mpg (ag_mtcars <-aggregate(mpg ~ carb, mtcars3, mean)) ## carb mpg ## 1 1 25.34 ## 2 2 22.40 ## 3 3 16.30 ## 4 4 15.79 ## 5 6 19.70 ## 6 8 15.00 mtcars3$carb <-factor(mtcars2$carb, levels=ag_mtcars[order(ag_mtcars$mpg), "carb"]) ggplot(mtcars3, aes(y=carb, x=mpg)) + geom_point(stat="identity", size=2, aes(color=carb))
An Example of Ordered and Faceted Bars/Points
The last plot in this section adds faceting to further draw distinction and allow for pattern recognition. The ordering of the facets can also be changed by reordering factor levels in a way that is sensible for representing the narrative the data is telling.
ggplot(mtcars3, aes(y=car, x=mpg)) + geom_point(stat="identity") + facet_grid(cyl~., scales = "free", space="free")
Recapping Section 1: Reordering by Bar/Point Size
In this first section we learned:
- Ordering factors by a numeric variable increases the ability to recognize patterns
- We can have (a) one row per factor level or (b) multiple rows per factor level.
- Adding faceting can increase the ability to further find patterns among the ordered figure.
Section 2: Speeding Up the Workflow With the plotflow Package
Because I have the need to reorder factors by other numeric variables frequently and using order and sometimes aggregate is tedious and annoying I have wrapped this process up as a function called order_by in the plotflow package. I pretty much ripped off the entire function from Thomas Wutzler. This function allows the user to sort a dataframe by 1 or more numeric variables and return the new dataframe with a re-leveled factor. This is useful in that a new dataframe is created rather than tampering with the original. The function also allows for a summery stat to be passed via the < face="courier">FUN< > argument in a similar fashion as aggregate. This approach save typing and is more intuitive.
Getting the plotflow package
To get plotflow you can install the devtools package and use the < face="courier">install_github< > function:
# install.packages("devtools") library(devtools) install_github("plotflow", "trinker")
What Does order_by do?
library(plotflow) dat <-aggregate(cbind(mpg, hp, disp)~carb, mtcars, mean) dat$carb <-factor(dat$carb) ## compare levels (data set looks the same though) dat$carb ## [1] 1 2 3 4 6 8 ## Levels: 1 2 3 4 6 8 order_by(carb, ~-hp + -mpg, data = dat)$carb ## [1] 1 2 3 4 6 8 ## Levels: 8 4 3 6 2 1
By default order_by returns a dataframe however we can also tell order_by to return a vector by setting < face="courier">df=FALSE< >.
## Return just the vector with new levels order_by(carb, ~ -hp + -mpg, dat, df=FALSE) ## [1] 1 2 3 4 6 8 ## Levels: 8 4 3 6 2 1
Let's see order_by in action.
Use order_by to Order Bars
library(ggplot2) ## Reset the data from Section 1 dat2 <-data.frame(car=rownames(mtcars), mtcars, row.names=NULL) ggplot(order_by(car, ~ mpg, dat2), aes(x=car, y=mpg)) + geom_bar(stat="identity") + coord_flip() + ggtitle("Order Pretty Easy")
Aggregated by Summary Stat
Carb Ordered By Summary (Mean) of mpg
## Ordered points with the order_by function a <-ggplot(order_by(carb, ~ mpg, dat2, mean), aes(x=carb, y=mpg)) + geom_point(stat="identity", aes(colour=carb)) + coord_flip() + ggtitle("Ordered Dot Plots Made Easy") ## Reverse the ordered points b <-ggplot(order_by(carb, ~ -mpg, dat2, mean), aes(x=carb, y=mpg)) + geom_point(stat="identity", aes(colour=carb)) + coord_flip() + ggtitle("Reverse Order Too!") grid.arrange(a, b, ncol=1)
Nested Usage (order_by on an order by dataframe)
ggplot(order_by(gear, ~mpg, dat2, mean), aes(mpg, carb)) + geom_point(aes(color=factor(cyl))) + facet_grid(gear~., scales="free") + ggtitle("I'm Nested (Yay for me!)")
The order_by function makes life a little easier.
Section 3: Using order_by on Real Data
Now I turn the attention to a real life usage of ordering a factor by a numeric variable in order to see patterns. A while back Abraham Mathew presented a blog post utilizing some interesting data on job satisfaction within bigger technology companies. His demonstrations showed various ways to utilize ggplot2 to visualize the data.
As I read the post I was also reading a bit of Stephen Few's work, which recommends ordering bars/dotplots to better see patterns. This visualization, which Mathew produced with ggplot2, is captivating:
However, I believed that by order the bars as Stephen Few (2012); Few (2009) suggests may enhance our ability to see a pattern; which of the four variables are linked?
In this next section we'll grab the data, clean it, reshape it, re-level the factors and plot in a more meaningful way to reveal patterns not seen before. Let's begin by loading the following packages:
library(RCurl) library(XML) library(rjson) library(ggplot2) library(qdap) library(reshape2) library(gridExtra)
Now we can scrape the data and extract the required pieces.
URL <-"http://www.payscale.com/top-tech-employers-compared-2012/job-satisfaction-survey-data" doc <-htmlTreeParse(URL, useInternalNodes=TRUE) nodes <-getNodeSet(doc, "//script[@type='text/javascript']")[[19]][[1]] dat <-gsub("];", "]", capture.output(nodes)[5:27]) ndat <-data.frame(do.call(rbind, fromJSON(paste(dat, collapse = ""))))[, -2] ndat[, 1:5] <-lapply(ndat, unlist) IBM <-grepl("International Business Machines", ndat[, 1]) ndat[IBM, 1] <-bracketXtract(ndat[IBM, 1]) ndat[, 1] <-sapply(strsplit(ndat[, 1], "\\s|,"), "[", 1)
At this point we re-level the factor level < face="courier">Employer.Name< > by job satisfaction.
## Re-level with order_by ndat[, "Employer.Name"] <-order_by(Employer.Name, ~Job.Satisfaction, ndat, df=FALSE) colnames(ndat)[1] <-"Employer" ndat ## Employer Job.Satisfaction Work.Stress Job.Meaning Job.Flexibility ## 1 Adobe 0.6875 0.7031 0.4532 0.8594 ## 2 Amazon.com 0.7723 0.7010 0.4901 0.7376 ## 3 AOL 0.7714 0.6572 0.4118 0.7714 ## 4 Apple 0.7800 0.6510 0.7114 0.7567 ## 5 Dell 0.6890 0.6275 0.4983 0.8712 ## 6 eBay 0.7097 0.6087 0.5824 0.8153 ## 7 Facebook 0.8750 0.6875 0.8125 0.9375 ## 8 Google 0.7987 0.5660 0.6387 0.8334 ## 9 Hewlett-Packard 0.5807 0.6034 0.4335 0.8733 ## 10 Intel 0.7339 0.6677 0.6892 0.8896 ## 11 IBM 0.6414 0.6637 0.4631 0.8946 ## 12 LinkedIn 1.0000 0.6923 0.8462 0.9166 ## 13 Microsoft 0.6777 0.6181 0.6099 0.9281 ## 14 Monster.com 0.7273 0.8181 0.5454 0.8181 ## 15 Nokia 0.7400 0.4800 0.5600 0.8200 ## 16 Nvidia 0.7692 0.5897 0.5385 0.7692 ## 17 Oracle 0.6713 0.6406 0.4221 0.9218 ## 18 Salesforce.com 0.8667 0.7334 0.6667 0.8275 ## 19 Samsung 0.6596 0.7447 0.6595 0.6170 ## 20 Sony 0.7500 0.6667 0.5217 0.8750 ## 21 Yahoo! 0.6762 0.5333 0.5145 0.8750
Now we can reshape the data to long format which ggplot2 prefers almost exclusively.
## Melt the data to long format mdat <-melt(ndat) mdat[, 2] <-factor(gsub("\\.", " ", mdat[, 2]), levels = gsub("\\.", " ", colnames(ndat)[-1])) head(mdat) ## Employer variable value ## 1 Adobe Job Satisfaction 0.6875 ## 2 Amazon.com Job Satisfaction 0.7723 ## 3 AOL Job Satisfaction 0.7714 ## 4 Apple Job Satisfaction 0.7800 ## 5 Dell Job Satisfaction 0.6890 ## 6 eBay Job Satisfaction 0.7097
Now our data is cleaned and reshaped with Employer re-leveled by job satisfaction. I chose this (job satisfaction) as the variable of interest because of literature I've read around job performance, teacher retention and job satisfaction. Let's see if re-leveling the factor has an improvement on the trends and patterns we can see.
ggplot(data=mdat, aes(x=Employer, y=value, fill=factor(Employer))) + geom_bar(stat="identity") + coord_flip() + ylim(c(0, 1)) + facet_wrap( ~ variable, ncol=2) + theme(legend.position="none") + ggtitle("Plot 3: Employee Job Satisfaction at Top Tech Companies") + ylab(c("Job Satisfaction"))
The first thing I noticed after the reordering is that Job Meaning and Job Satisfaction appear to be related. In general, higher satisfaction corresponds with greater meaning. I also noticed that Flexibility and Stress do not appear to correspond with satisfaction. This made me curious and so I ran a simple regression model with Satisfaction as the outcome and the other three variables as predictors. The story from the regression model is similar to the visualization.
mod <-lm(Job.Satisfaction ~ Work.Stress + Job.Meaning + Job.Flexibility, data=ndat) anova(mod) ## Analysis of Variance Table ## ## Response: Job.Satisfaction ## Df Sum Sq Mean Sq F value Pr(>F) ## Work.Stress 1 0.0069 0.0069 1.45 0.2452 ## Job.Meaning 1 0.0816 0.0816 17.04 0.0007 *** ## Job.Flexibility 1 0.0006 0.0006 0.13 0.7260 ## Residuals 17 0.0814 0.0048 ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 summary(mod) ## ## Call: ## lm(formula = Job.Satisfaction ~ Work.Stress + Job.Meaning + Job.Flexibility, ## data = ndat) ## ## Residuals: ## Min 1Q Median 3Q Max ## -0.12043 -0.03002 -0.00263 0.03268 0.11915 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 0.3101 0.2413 1.29 0.2160 ## Work.Stress 0.1062 0.2147 0.49 0.6273 ## Job.Meaning 0.5241 0.1288 4.07 0.0008 *** ## Job.Flexibility 0.0733 0.2058 0.36 0.7260 ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 0.0692 on 17 degrees of freedom ## Multiple R-squared: 0.523, Adjusted R-squared: 0.438 ## F-statistic: 6.21 on 3 and 17 DF, p-value: 0.00483
The model accounts for ~50% of the variability in Job Satisfaction. While the model is significant there clearly is more than just Meaning that impacts Satisfaction. I Decided to do a bit more plotting and use the preattentive attributes of color and size to represent Flexibility and Stress in the visual model.
theplot <-ggplot(data=ndat, aes(x = Job.Meaning, y = Job.Satisfaction)) + geom_smooth(method="lm", fill = "blue", alpha = .1, size=1) + geom_smooth(color="red", fill = "pink", alpha = .3, size=1) + xlim(c(.4, .9)) + geom_point(aes(size = Job.Flexibility, colour = Work.Stress)) + geom_text(aes(label=Employer), size = 3, hjust=-.1, vjust=-.1) + scale_colour_gradient(low="gold", high="red") theplot
There is certainly a pull on the smoothed line by this group of tech companies, circled below, that may be an unaccounted variable in the model.
theplot + annotation_custom(grob=circleGrob(r = unit(.4,"npc")), xmin=.47, xmax=.57, ymin=.72, ymax=.82)
If we view the data as two separate smoothed regression lines we get a more predictable model. This indicates a variable that we have not included.
ndat$outs <-1 ndat$outs[ndat$Employer %in% qcv(AOL, Amazon.com, Nvidia, Sony)] <-0 ggplot(data=ndat, aes(x = Job.Meaning, y = Job.Satisfaction)) + geom_smooth(method="lm", fill = "blue", alpha = .1, size=1, aes(group=outs)) + geom_smooth(color="red", fill = "pink", alpha = .3, size=1) + xlim(c(.4, .9)) + geom_point(aes(size = Job.Flexibility, colour = Work.Stress)) + geom_text(aes(label=Employer), size = 3, hjust=-.1, vjust=-.1) + scale_colour_gradient(low="gold", high="red")
We've learned:
- Re-leveling/re-ordering a factor by a numeric variable(s) can lead to important pattern detection in data.
- The < face="courier">levels< > argument to factor is key to the reordering.
- order and sometimes aggregate allows the re-leveling to occur.
- The order_by function in the plotflow package can make re-leveling easier.
- Faceting can amplify the distinction made by the re-leveling.
< size="2">*Created using the reports (Rinker, 2013) package< >
References
- Stephen Few, (2009) Now You See It: Simple Visualization Techniques for Quantitative Analysis.
- Stephen Few, (2012) Show me the numbers: Designing tables and graphs to enlighten.
- Tyler Rinker, (2013) reports: Package to asssist in report writing. http://github.com/trinker/reports
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.