[This article was first published on Econometrics by Simulation, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Overall I find the graphs produced from this code to be beautiful and fascinating though I am not sure if I would really use them as a form of data exploration. In addition, I am not sure if I would expect anybody to be able to understand what I am communicating. But let us first see some graphs associated with different distributions before we jump to any conclusions.
to distributions with sparse data at some areas) as well as the ability to heat map teh bivariate distribution based on a third variable z. Find the code on github.
Figure 1: Plotting data points. |
Figure 2: A standard 3D density graph created with the MASS package. |
Figure 3: For different ways of slicing the same data. |
Let us keep these previous graphs in mind as we explore Robert Grants waterfall graph. The bivariate distribution x and y plotted on the previous graphs can be seen in Figure 3 mapped from four different perspectives (the bottom two of the perspectives are reverse order so the bottom left hand panel is not the same). One immediately notices that while there are some similarities there are also a lot of differences.
Visibly the waterfall graph seems to have more information with many more lines as some capitalization on transparencies. The four 3D renderings of the data is much easier to identify as the same information from different angles. However, the waterfall graph seems to be less of a rotation or transposition of the same object. This puzzled me for a while until I realized that the waterfall graph is not communicating the same information at the 3D graph in its current form. Notice the top left graph. Notice how all of the peaks seem to be equivalent in the waterfall graph yet very pointed in the 3D rendering.
Figure 4: Distance between slices is proportional to number of observations. |
Figure 5: Transparency is proportional to sparcity |
Figure 6: Height is proportional to number of observations. |
Now that we have a few different ways of communicating the same information, let us see which set of graphs seems to get us to a place of understanding the relationship between our two variables of interest. First let us note how I generated the data (y~N(0,1) and x~3*N(0,1)+y^2). Thus x and y are uncorrelated but still dependent. All of the figures communicate some of this information. I might personally prefer Figure 1 because the figure sufficiently communicates a significant representation of the data. I think adding the 3d graphs does not add significantly to Figure 1. But this is because everything peaks smoothly in the dark area. If however there were say a dip in density within the dark region then Figure 1 would not be able to warn us of that dip unless it was severe.
Comparing Figure 1, Figure 2, and Figure 3, 4, 5, 6 I am not sure if I would favor the waterfall graphs (except 6 which hardly counts). While beautiful they do not communicate to me clearly what they are representing. However, perhaps this is just an issue of developing an internal intuition of what is going on. To that end, let us explore some less complex data relationships.
Figure 7: No relationship, positive correlation, negative correlation, linear dependency. |
Figure 7 (TL, TR, BR, RL) shows what no=correlation (x=N(0,1),y=N(0,1)), positive (x=N(0,1)+.5y), negative (x=N(0,1)-.5y), and perfect coliniearity (x=y). Does this help with out intuition? Maybe a little bit.
Figure 8: It is possible to include an additional variable z which is used to select color from multiple RGB specified colors. |
Overall, I would have to say that I am unconvinced that the waterfall graph is an effective substitute for a 3D graph. However, there is no reason to believe that this should be the only criteria for defining such a graph! So far we have been acting as if both x and y were random variables.
But often one of the variables in not random, or we are comfortable acting as if it not random when evaluating the other variable. In addition, we might not care how sparsely populated our slices are, we are mostly concerned with how our distributions change over time. Take income information and age. We might not care how many x year old people their are. But we may care at each age category how the distribution of income lays.
From this real world example, I therefore think that the waterfall slice graphing framework is very useful. it is not a replacement for the 3D graph but rather an alternative representation of a different feature of the data. If you would like to find the code to create the graphs used in this article please check out my github repo. And if you find this post helpful, please leave me some feedback!
IPUMS Citation:
Steven Ruggles, J. Trent Alexander, Katie Genadek, Ronald Goeken, Matthew B. Schroeder, and Matthew Sobek. Integrated Public Use Microdata Series: Version 5.0 [Machine-readable database]. Minneapolis, MN: Minnesota Population Center [producer and distributor], 2010.
To leave a comment for the author, please follow the link and comment on their blog: Econometrics by Simulation.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.