The Datasaurus Dozen
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
There's a reason why data scientists spend so much time exploring data using graphics. Relying only on data summaries like means, variances, and correlations can be dangerous, because wildly different data sets can give similar results. This is a principle that has been demonstrated in statistics classes for decades with Anscombe's Quartet: four scatterplots which despite being qualitatively different all have the same mean and variance and the same correlation between them.
(You can easily check this in R by loading the data with data(anscombe)
.) But what you might not realize is that it's possible to generate bivariate data with a given mean, median, and correlation in any shape you like — even a dinosaur:
The paper linked below describes a method of perturbing the points in a scatterplot, moving them towards a given shape while keeping the statistical summaries close to the fixed target value. The shapes include a star, and a cross, and the “DataSaurus” (first created by Alberto Cairo). The authors have published a dataset they call the “DataSaurus Dozen” (also available as an R package on GitHub, with thanks to Steph Locke) of the 12 scatterplots shown. Interestingly, even the transitional frames in the animations above maintain the same summary statistics to two decimal places. Python was used to generate the data sets (and the code should be available at the link below soon.)
Read the paper linked below for more details, and always remember: look at your data!
AutoDesk Research: Same Stats, Different Graphs: Generating Datasets with Varied Appearance and Identical Statistics through Simulated Annealing
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.