Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Recap
In Part I, I showed that FN applied sectoral areas, rather than a pie chart or conventional histogram, to reduce the visual impact of highly variable zymotic disease data from the Crimean War. She wanted to demonstrate that diminishing disease was due mostly to her sanitation methodologies. The square-root attenuation of magnitudes, arising from the use of sectoral areas, helped her accomplish that objective. In addition, I showed that a plausibly simpler visualizaiton could have been had with a single 24-month cam diagram. See Fig. 2.Although R has some very sophisticated tools for producing almost any kind of data visualization—some of which I used in Part I—a very impressive interactive version of Fig. 1 has been created by the Statistical Laboratory at Cambridge University using Adobe Flash (may need browser plug-in). In addition to interactively watching the cams unfold at your own pace, you can also run it as an animation. You can even run it backwards! The tab labeled Superimposed shows the two original cam diagrams as an overlaid animation; an effect that FN needed to avoid in her static representation.
This kind of visualization requires a lot of work to construct and is perfect as an educational tool but complete overkill for typical data exploration. That said, their Bar Chart visualization is important for our discussion. It shows the original data presented as a classic histogram. Subsequently, I’ll use the terms histogram and bar chart synonymously.
Back to Bar Charts
Using R to create Fig. 3, I’ve reproduced a static version of the interactive Flash bar chart. Again, these data are the raw magnitudes (i.e., Mortality per 1000 soldiers) and not the rescaled data used in FN’s cam diagram (Fig. 1), but I’ll return to that detail in a minute.Social historian, Hugh Small, claims this histogram representation “didn’t convey the same message” as FN’s cam plots. In support of this position, he goes on to say
“My graphics tutor was a data expert but in management consulting one tends to use graphics to present messages, not data, and there is a rule number zero that ‘your graphic should support the message, the whole message, and nothing but the message’.”So, Small is arguing for Fig. 1 from a purely marketing standpoint, which is fine. However, that position is a bit puzzling because, if I understand his own historical investigations, FN’s cam diagrams had less than the intended marketing impact. Recall from Part I that the marketing message of the cam diagrams was bigger is badder, but pamphlet-wars broke out.
Moreover, I find the giant peak in the middle of Fig. 3 to be not only visually apparent but screaming out to be explained. Conversely, FN was trying to dampen the visual impact of that spike in the disease data, but I have to apply a lot of cognitive effort to understand all that based on Fig. 1. Without any quantitative frame of reference, it’s hard to know what the data is trying to tell us.
The main purpose of any visual message is to hook the attention your audience. FN’s cam diagram might offer a visual advantage over a conventional histogram, precisely because it is unconventional and therefore might be more likely to draw attention in the sense of causing the reader to ask, “WTF is that!?” For my part, I tend to react the same way when I see big spikes in histograms or any time series data. The big peak in Fig. 3 hooks my attention, even though I’ve already been exposed to more histograms than I care to think about. So, both representations (Figs. 1 and 3) can act as visual hooks, and FN’s cam diagram might have a slight edge, due to being unfamiliar but, I would suggest that’s about the extent of it.
The very next thing that goes through my mind is to start wondering about the shape of that peak. Fig. 1 does not provide any visual cue to take that mental step.
Is That Normal?
Recognizing that we are now moving well beyond a purely visual marketing message, the shape of the peak in Fig. 3 suggests that the data might be Normal-distributed. We can easily check for normality with a QQ-plot. Using R, the result is shown in Fig. 4.qqnorm(AnnZymPC/40,main="QQ for Zymotic data") abline(a=0,b=1)
It seems clear from this preliminary plot that the FN data do not conform to a Normal distribution. Given that negative result, the slight asymmetry in the histogram profile suggests that the data might be Gamma or Weibull distributed. Unfortunately, neither these nor any other similar distributions seem to apply.
Rescaling the Histogram
Let’s return to the main motivation for the cam plots (Fig. 1). FN wanted to reduce the visual impact of the zymotic disease data so as to counter any argument that the reduction in disease levels was due to things like seasonal effects rather than her sanitary methods. She achieved the desired result by employing equiangular sectors. As I discussed in Part I, that approach also has the side effect of producing a square-root scaling of the radial magnitudes (data height).The two cam plots in Fig. 1 represent before (right cam) and after (left cam) application of her sanitary methods on the battlefield. The histogram in Fig. 3 can be brought into correspondence with to the two cam diagrams by bisecting its peak vertically. The left side of the peak corresponds to the right cam plot and vice versa. Everything is back to front for the reasons I gave in Part I.
With this as background, it seems natural to ask, what would a square-root rescaling of the histogram in Fig. 3 look like? The result is shown in Fig. 5 (top right). It’s clear that the magnitudes are now more similar, which was FN’s purpose.
However, there are other ways to attenuate highly variable data. SPEC.org, for example, rescales highly variable CPU performance data by applying the geometric mean to benchmark measurements supplied by computer vendors. The geometric mean is simply the log-weighted arithmetic mean, and has the effect of making competitive data look more similar—a well-known marketing ploy.
Pursuing this idea, other log-weighted data transformations are shown in the bottom row of Fig. 5. Of these, the log-linear plot (lower left) seems most appropriate for reexamining FN’s visualization. Roughly stated, the before growth in zymotic disease is approximately exponential at one rate, whereas the decline in zymotic disease is also approximately exponential but at a different rate. Let’s see how that looks in R.
Exponential Models
On log-linear axes, an exponential function appears as a straight line with a slope corresponding to its rate parameter (e.g., growth rate, decay rate). See Fig. 6.The corresponding R code for the log-linear statistical regression is:
## Linear fit to log-linear data # Define the "before" and "after" time windows tw1 <- which(dates < "1855-01-01") tw2 < - which(dates >= "1855-01-01") fn1.fit <- lm(log(MthlyZym[tw1]) ~ tw1, data=fn[tw1,]) fn2.fit <- lm(log(MthlyZym[tw2]) ~ tw2, data=fn[tw2,]) T.double <- log(2)/coef(fn1.fit)["tw1"] T.half <- llog(1/2)/coef(fn2.fit)["tw2"] ############################################# ### Combined log-linear plots (Don't need to use log="y" again in points function) plot(dates, MthlyZym,type="h",xlab="",ylab="",lwd=6,col="blue",ylim=c(0.1,200),log="y") points(dates+5, MthlyWnd,type="h",lwd=2,col="red") points(dates+10, MthlyAll,type="h",lwd=2,col="black") lines(dates[tw1],exp(predict(fn1.fit)),lty="dashed",lwd=2,col="darkorange") lines(dates[tw2],exp(predict(fn2.fit)),lty="dashed",lwd=2,col="green3") text(x=dates[which(dates=="1854-08-01")],y=100, paste("Doubling time:\n", sprintf("%4.2f",T.double), "months")) text(x=dates[which(dates=="1855-09-01")],y=100, paste("Half life:\n", sprintf("%4.2f",T.half), "months")) title(main="Log-linear Fit of Preventable Zymotic Data")
Having determined the slopes from the log-linear plot, we can transform back to the original linear axes of Fig. 3 and superimpose the corresponding exponential curves in Fig. 7.
Here now, for the first time, we can assess the quantitative impact of FN's methods. Simply put, she managed to slow the progress of battlefield disease by about a factor of four from doubling every month to halving every 2 months.
As I alluded to in Part I, although it would have required a lot more tedious work, the same kind of quantitative analysis could have been performed in FN's day. Although statistics was still somewhat in its infancy, the great German mathematician, Carl Friedrich Gauss is credited with developing the underpinnings of least squares regression in 1795.
SIR Model
Finally, we can go one step further with our quantitative analysis. The existence of the two different exponential functions in Fig. 7 also tells us why the FN data cannot be accommodated by simple parametric distributions, like Gamma or Weibull. In some sense, the peak is too sharp. So what peak is it?It doesn't necessarily have to be any peak—in the sense of something that is immediately familiar or has a simple analytic description—but, the shape of the histogram in Fig. 7 is reminiscent of an epidemic profile. One of the earliest epidemic models, developed in the 1920s, is known as the Susceptible, Infected, Recovered or SIR model. My comparison of FN data with the SIR model can be seen in Fig. 8. Clearly, this is something that would not have been available to FN or her contemporaries, and is a testament to the power of modern personal computing.
The SIR model is rather complicated and has no analytic solutions. The differential equations of evolution have to be solved numerically. The interested reader can find more details and accompanying R code in SIR models of epidemics (PDF).
Figure 8 is only intended to give the flavor of what can be achieved with modern statistical tools like R. Suffice to say, it demonstrates the depth of the message that can be achieved with the right visual presentation of your data. We've gone from raw data, through various data visualizations, to a full-blown epidemic model.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.