Visualizing Survey Data : Comparison Between Observations
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Cybersecurity is a domain that really likes survey, or at the very least it has many folks within it that like to conduct and report on surveys. One recent survey on threat intelligence is in it’s second year, so it sets about comparing answers across years. Rather than go ingo the many technical/statistical issues with this survey, I’d like to focus on alternate ways to visualize the comparison across years.
We’ll use the data that makes up this chart (Figure 3 from the report):
since it’s pretty representative of the remainder of the figures.
Let’s start by reproducing this figure with ggplot2:
library(dplyr) library(tidyr) library(stringr) library(ggplot2) library(scales) library(ggthemes) library(extrafont) loadfonts(quiet=TRUE) read.csv("question.csv", stringsAsFactors=FALSE) %>% gather(year, value, -belief) %>% mutate(year=factor(sub("y", "", year)), belief=str_wrap(belief, 40)) -> question beliefs <- unique(question$belief) question$belief <- factor(beliefs, levels=rev(beliefs[c(1,2,4,5,3,7,6)])) gg <- ggplot(question, aes(belief, value, group=year)) gg <- gg + geom_bar(aes(fill=year), stat="identity", position="dodge", color="white", width=0.85) gg <- gg + geom_text(aes(label=percent(value)), hjust=-0.15, position=position_dodge(width=0.8), size=3) gg <- gg + scale_x_discrete(expand=c(0,0)) gg <- gg + scale_y_continuous(expand=c(0,0), label=percent, limits=c(0,0.8)) gg <- gg + scale_fill_tableau(name="") gg <- gg + coord_flip() gg <- gg + labs(x=NULL, y=NULL, title="Fig 3: Reasons for fully participatingn") gg <- gg + theme_tufte(base_family="Arial Narrow") gg <- gg + theme(axis.ticks.x=element_blank()) gg <- gg + theme(axis.text.x=element_blank()) gg <- gg + theme(axis.ticks.y=element_blank()) gg <- gg + theme(legend.position="bottom") gg <- gg + theme(plot.title=element_text(hjust=0)) gg |
Now, the survey does caveat the findings and talks about non-response bias, sampling-frame bias and self-reporting bias. However, nowhere does it talk about the margin of error or anything relating to uncertainty. Thankfully, both the 2014 and 2015 reports communicate population and sample sizes, so we can figure out the margin of error:
library(samplesize4surveys) moe_2014 <- e4p(19915, 701, 0.5) ## With the parameters of this function: N = 19915 n = 701 P = 0.5 DEFF = 1 conf = 0.95 . ## The estimated coefficient of variation is 3.709879 . ## The margin of error is 3.635614 . ## moe_2015 <- e4p(18705, 692, 0.5) ## With the parameters of this function: N = 18705 n = 692 P = 0.5 DEFF = 1 conf = 0.95 . ## The estimated coefficient of variation is 3.730449 . ## The margin of error is 3.655773 . ## |
They are both roughly 3.65%
so let’s take a look at our dodged bar chart again with this new information:
mutate(question, ymin=value-0.0365, ymax=value+0.0365) -> question gg <- ggplot(question, aes(belief, value, group=year)) gg <- gg + geom_bar(aes(fill=year), stat="identity", position=position_dodge(0.85), color="white", width=0.85) gg <- gg + geom_linerange(aes(ymin=ymin, ymax=ymax), position=position_dodge(0.85), size=1.5, color="#bdbdbd") gg <- gg + scale_x_discrete(expand=c(0,0)) gg <- gg + scale_y_continuous(expand=c(0,0), label=percent, limits=c(0,0.85)) gg <- gg + scale_fill_tableau(name="") gg <- gg + coord_flip() gg <- gg + labs(x=NULL, y=NULL, title="Fig 3: Reasons for fully participatingn") gg <- gg + theme_tufte(base_family="Arial Narrow") gg <- gg + theme(axis.ticks.x=element_blank()) gg <- gg + theme(axis.text.x=element_blank()) gg <- gg + theme(axis.ticks.y=element_blank()) gg <- gg + theme(legend.position="bottom") gg <- gg + theme(plot.title=element_text(hjust=0)) gg |
Hrm. There seems to be a bit of overlap. Let’s just focus on that:
gg <- ggplot(question, aes(belief, value, group=year)) gg <- gg + geom_pointrange(aes(ymin=ymin, ymax=ymax), position=position_dodge(0.25), size=1, color="#bdbdbd", fatten=1) gg <- gg + scale_x_discrete(expand=c(0,0)) gg <- gg + scale_y_continuous(expand=c(0,0), label=percent, limits=c(0,1)) gg <- gg + scale_fill_tableau(name="") gg <- gg + coord_flip() gg <- gg + labs(x=NULL, y=NULL, title="Fig 3: Reasons for fully participatingn") gg <- gg + theme_tufte(base_family="Arial Narrow") gg <- gg + theme(axis.ticks.x=element_blank()) gg <- gg + theme(axis.text.x=element_blank()) gg <- gg + theme(axis.ticks.y=element_blank()) gg <- gg + theme(legend.position="bottom") gg <- gg + theme(plot.title=element_text(hjust=0)) gg |
The report actually makes hard claims based on the year-over-year change in the answers to many of the questions (not just this chart). Most have these overlapping intervals. Now, I understand that when a paying customer says they want a report that they wouldn’t really be satisfied with a one-pager saying “See last years’s report”, but not communicating the uncertainty in these results seems like a significant omission.
But, I digress. There are better (or at least alternate) ways than bars to show this comparison. One is a “dumbbell chart”.
question %>% group_by(belief) %>% mutate(line_col=ifelse(diff(value)<0, "2015", "2014"), hjust=ifelse(diff(value)<0, -0.5, 1.5)) %>% ungroup() -> question gg <- ggplot(question) gg <- gg + geom_path(aes(x=value, y=belief, group=belief, color=line_col)) gg <- gg + geom_point(aes(x=value, y=belief, color=year)) gg <- gg + geom_text(data=filter(question, year=="2015"), aes(x=value, y=belief, label=percent(value), hjust=hjust), size=2.5) gg <- gg + scale_x_continuous(expand=c(0,0), limits=c(0,0.8)) gg <- gg + scale_color_tableau(name="") gg <- gg + labs(x=NULL, y=NULL, title="Fig 3: Reasons for fully participatingn") gg <- gg + theme_tufte(base_family="Arial Narrow") gg <- gg + theme(axis.ticks.x=element_blank()) gg <- gg + theme(axis.text.x=element_blank()) gg <- gg + theme(axis.ticks.y=element_blank()) gg <- gg + theme(legend.position="bottom") gg <- gg + theme(plot.title=element_text(hjust=0)) gg |
I’ve used line color to indicate whether the 2015 value increased or decreased from 2014.
But, we still have the issue of communicating the margin of error. One way I came up with (which is not perfect) is to superimpose the dot-plot on top of the entire margin of error interval. While it doesn’t show the discrete start/end margin for each year it does help to show that making definitive statements on the value comparisons is not exactly a good idea:
group_by(question, belief) %>% summarize(xmin=min(ymin), xmax=max(ymax)) -> band gg <- ggplot(question) gg <- gg + geom_segment(data=band, aes(x=xmin, xend=xmax, y=belief, yend=belief), color="#bdbdbd", alpha=0.5, size=3) gg <- gg + geom_path(aes(x=value, y=belief, group=belief, color=line_col), show.legend=FALSE) gg <- gg + geom_point(aes(x=value, y=belief, color=year)) gg <- gg + geom_text(data=filter(question, year=="2015"), aes(x=value, y=belief, label=percent(value), hjust=hjust), size=2.5) gg <- gg + scale_x_continuous(expand=c(0,0), limits=c(0,0.8)) gg <- gg + scale_color_tableau(name="") gg <- gg + labs(x=NULL, y=NULL, title="Fig 3: Reasons for fully participatingn") gg <- gg + theme_tufte(base_family="Arial Narrow") gg <- gg + theme(axis.ticks.x=element_blank()) gg <- gg + theme(axis.text.x=element_blank()) gg <- gg + theme(axis.ticks.y=element_blank()) gg <- gg + theme(legend.position="bottom") gg <- gg + theme(plot.title=element_text(hjust=0)) gg |
Finally, the year-to-year nature of the data was just begging for a slopegraph:
question %>% mutate(vjust=0.5) -> question question[(question$belief=="Makes threat data more actionable") & (question$year=="2015"),]$vjust <- -1 question[(question$belief=="Reduces the cost of detecting andnpreventing cyber attacks") & (question$year=="2015"),]$vjust <- 1.5 question$year <- factor(question$year, levels=c("2013", "2014", "2015", "2016", "2017", "2018")) gg <- ggplot(question) gg <- gg + geom_path(aes(x=year, y=value, group=belief, color=line_col)) gg <- gg + geom_point(aes(x=year, y=value), shape=21, fill="black", color="white") gg <- gg + geom_text(data=filter(question, year=="2015"), aes(x=year, y=value, label=sprintf("u2000%s %s", percent(value), gsub("n", " ", belief)), vjust=vjust), hjust=0, size=3) gg <- gg + geom_text(data=filter(question, year=="2014"), aes(x=year, y=value, label=percent(value)), hjust=1.3, size=3) gg <- gg + scale_x_discrete(expand=c(0,0.1), drop=FALSE) gg <- gg + scale_color_tableau(name="") gg <- gg + labs(x=NULL, y=NULL, title="Fig 3: Reasons for fully participatingn") gg <- gg + theme_tufte(base_family="Arial Narrow") gg <- gg + theme(axis.ticks=element_blank()) gg <- gg + theme(axis.text=element_blank()) gg <- gg + theme(legend.position="none") gg <- gg + theme(plot.title=element_text(hjust=0.5)) gg <- gg + theme(plot.title=element_text(hjust=0)) gg |
It doesn’t help communicate uncertainty but it’s a nice alternative to bars.
Hopefully this helps provide some alternatives to bars for these types of comparisons and also ways to communicate uncertainty without confusing the reader (communicating uncertainty to a broad audience is hard).
Perhaps those conducting surveys (or data analyses in general) could subscribe to a “data visualizers” paraphrase of a quote from Epidemics, Book I, of the Hippocratic school:
Practice two things in your dealings with data: either help or do not harm the reader.
The full Rmd and data for this post is in this gist.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.