What is “Tidy Data”?
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
I would like to write a bit on the meaning and history of the phrase “tidy data.”
Hadley Wickham has been promoting the term “tidy data.” For example in an eponymous paper, he wrote:
In tidy data:
- Each variable forms a column.
- Each observation forms a row.
- Each type of observational unit forms a table.
Wickham, Hadley “Tidy Data”, Journal of Statistical Software, Vol 59, 2014.
Let’s try to apply this definition to following data set from the Wikipedia:
Tournament | Year | Winner | Winner Date of Birth |
---|---|---|---|
Indiana Invitational | 1998 | Al Fredrickson | 21 July 1975 |
Cleveland Open | 1999 | Bob Albertson | 28 September 1968 |
Des Moines Masters | 1999 | Al Fredrickson | 21 July 1975 |
Indiana Invitational | 1999 | Chip Masterson | 14 March 1977 |
This would seem to be a nice “ready to analyze” data set. Rows are keyed by tournament and year, and rows carry additional key-derived observations of winner’s name and winner’s date of birth. From such a data set we could look for repeated winners, and look at the age of winners.
A question is: is such a data set “tidy”? The paper itself claims the above definitions are “Codd’s 3rd normal form.” So, no the above table is not “tidy” under that paper’s definition. The the winner’s date of birth is a fact about the winner alone, and not a fact about the joint row keys (the tournament plus year) as required by the rules of Codd’s 3rd normal form. The critique being: this data presentation does not express the intended data invariant that Al Fredrickson must have the same “Winner Date of Birth” in all rows.
Around January of 2017 Hadley Wickham apparently retconned the “tidy data” definition to be:
Tidy data is data where:
- Each variable is in a column.
- Each observation is a row.
- Each value is a cell.
Notice point-3 is now something possibly more related to Codd’s guaranteed access rule, and now the example table is plausibly “tidy.”
The above concept was already well known in statistics and called a “data matrix.” For example:
A standard method of displaying a multivariate set of data is in the form of a data matrix in which rows correspond to sample individuals and columns to variables, so that the entry in the ith row and jth column gives the value of the jth variate as measured or observed on the ith individual.
Krzanowski, W. J., F. H. C. Marriott, Multivariate Analysis Part 1, Edward Arnold, 1994, page 1.
One must understand that in statistics, “individual” often refers to observations, not people.
The above reference clearly considers “data matrix” to be a noun phrase already in common use in statistics. It is in the book’s index, and often used in phrases such as:
Suppose X is an n × p data matrix …
Krzanowski, W. J., F. H. C. Marriott, Multivariate Analysis Part 1, Edward Arnold, 1994, page 75.
So statistics not only already has the data organization concepts, statistics already has standard terminology around it. Data engineering often called this data organization a “de-normalized form.”
As a further example, the statistical system R, itself uses variations the above standard terminology. Take for instance the help()
text from R’s data.matrix()
method:
data.matrix {base} R Documentation Convert a Data Frame to a Numeric Matrix Description Return the matrix obtained by converting all the variables in a data frame to numeric mode and then binding them together as the columns of a matrix. Factors and ordered factors are replaced by their internal codes.
What is the extra “Factors and ordered factors are replaced by their internal codes” part going on about? That is also fairly standard, let’s expand the earlier data matrix quote a bit to see this.
A standard method of displaying a multivariate set of data is in the form of a data matrix in which rows correspond to sample individuals and columns to variables, so that the entry in the ith row and jth column gives the value of the jth variate as measured or observed on the ith individual. When presenting data in this form, it is customary to assign a numerical code to the categories of a qualitative variable …
Krzanowski, W. J., F. H. C. Marriott, Multivariate Analysis Part 1, Edward Arnold, 1994, page 1.
Note: for many R analyses the model.matrix()
command is implicitly called in preference to the data.matrix()
command, as this conversion expands factors into “dummy variables”- which is a representation often more useful for modeling. The model.matrix()
documentation starts as follows:
model.matrix {stats} R Documentation Construct Design Matrices Description model.matrix creates a design (or model) matrix, e.g., by expanding factors to a set of dummy variables (depending on the contrasts) and expanding interactions similarly.
So to summarize: the whole time we have been talking about well understood concepts of organizing data for analysis that have a long history.
Frankly it appears “tidy data” is something akin to a trademark or marketing term, especially in its “tidyverse” variation.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.